DOCUMENT RESUME 



ED 238 940 

AUTHOR 
TITLE 

INSTITUTION 

SPONS AGENCY, 
PUB DATE 
GRANT - 
NOTE 
PUB TYPE 



EDRS PRICE 
DElSppPTORS 



IDENTIFIERS 



v 1 . * • TM 840. 042 
Wilcox, Rajid R. ' y 

Optimal Measurement Considerations for Diagnostic 
Tests. Methodology Project. 

California Univ., Los Angeles. Center for the Study 
of Eyalua t i on . 

National Inst, of* Educatiqn (ED), Washington, DC. 

Nov 83 ■ \ 

NIE-G-83-0001 

108p. 

Collected Works - General (020) — Reports - 
Research/Technical (143 ) 

MF01/PC05 Plus Postage.. , P " . 

^Diagnostic Tests; ^Estimation (Mathematics); - - 
Guessing (Tests) ; ^Latent Trait Theory; *Measu£ement 
Techniques; *Multivari ate Analysis; Scoring; Testing 
Problems; Test* Items; *True Scores 
Linear Measurement ^ 



ABSTRACT 

This document presents a series of five papers; 
describing issues in educational measurement. ?A Simple Model for 
Diagnostic Testing; When There Are Several Types of Misinformation" ^ 
directly addresses the diagnostic issue. It describes-* a simple latent 
trait model for testing, examines use of erroneous algorithms, .and 
illustrates the derivation of an optimal scoring rule for multiple 
choice test items. "Measuring Mental Abilities with Latent State 
Models" has three goals: to review the latent state models that have . 
been proposed for measuring, aptitude a^ch^fhie^ement; to outline the 
measurement problems that can now be^solVed »with latent state models; 
and to discuss hpw latent state and latent traft models are related. 
"Strong True Score Theory" reviews true score models in light of 
various assumptions about guessing. "Approximating Multivariate 
Distributions" suggests a simple approximation of multivariate 
distributions. The suggested method is compared with several other 
approximations. These comparisons indicate that the new approximation 
nearly # always gives'better results ^ "Unbiased Estimation in a Closed 
Sequential Testing Procedure" provides an optimal linear estimator of 
the proportion of items within an item domain that an examinee would 
answer correctly if every item were attempted* (Author/PN) 



*************** ***** ******************** ****,* ****** ********* **jMk******* 

* * Reproductions supplied by EDRS are the b6st 'that c^n be made * 

from the original document. * 
****** ************** ********** ********* ************** ******************* 



9 

ERLC 



Deliverable - NovemberiS&3 
METHODOLOGY PROJECT V 



OPTIMAL MEASUREMENT CONSIDERATIONS 
FOR DIAGNOSTIC TESTS 



Rand R.* Wilcox 



Grant Number 
NIE-G-83-0001 



"PERMISSION T.0 REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 



6- & 



- I 

Center for the Study of Evaluation 
UCLA Graduate School of Education 
Los 'Angeles, California 



•November 1983 



•TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC)." 



U-S. DEPARTMENT OF EDUCATION I 

* ; NATIONAL INSTITUTE OF EDUCATION 

JUCATICNAL RESOURCES INFORMATION 
s CENTER (ERIC) 
This document has been reproduced as 
received from the person or organization 
originating it. I 
□ Minor changes have been made to improve 
reproduction quality. . 



• Points of view or opinions stated in this docu- 
ment do not necessarily represent official NIE 
position or policy. , ' 



The project presented or reported herein was per- 
formed pursuant to a grant from the National in- 
stitute of Education, Department of Education. 
However, the opinions expressed herein dd not 
necessarily reflect the position or policy of the 
National Institute of Education, and.no official 
endorsement 'by the National Institute of Educa- 
tion should be inferred. 



* * PREFACE 

This document presents a series of papers describing issues "in edu- 
cational measurement. The first, paper, "A Simple cModel for Diagnostic 
Testing When There Are Several Types of Misinformation, 11 "directly addresses 
the diagnostjj^r issue. It describes a simple latent tra^t mp,del for ^tesiing^, 
examines use of erroneous algorithms, and illustrates the derivation "Of ah . 
optimal "scoring rule foY multiple choice test items. * , . * 

*■ The second paper, "Measuring MentaT Abilities with Latent Sftate Models, 
has three' goals: 1) to review the latent state models that have teen pro-,, 
posed for measuring aptitude and achievement; 2) to outline the measurement 

problems that can now be solved with latent -state model*-r~aT;d 3) to discuss 
\ S ' ' ' \ . • . ■ 

how latent state and latent trait. models are related. 

The third paper, "Strong True Score Theory," reviews true score models 

in light of various assumptions about guessing. It is an invited paper to 

- » « -\ 

f appear in an encyclopedia for statistics. . \ A. J_ 

The fourth paper, "Approximating Multivariate Distributions'," suggests 
a simple approximation of multivariate distributions. The suggested method 
? is ^compared with several other approximations.. These comparisons indicate 
that the new approximation nearly always gives better results. 

The final paper,. "Unbiased Estimation in a Closed Sequential -Testing , 
Procedure" provides an optimal linear estimator of the proportion of items 
within an item domain that an examinee wouW ahwer correctly of every item 
were attempted. V *~ . . , • 



A 



ERIC 



A SIMPLE MODEL FOR DIAGNOSTIC JESTING WHEN 
THERE ARE SEVERAL .TYPES OF MISINFORMATION 



. Rand R. Wilcox , • 
/ Department of Psychology , 
University of Southern California 

and 

Center for the Study of Evaluation 
University of California, Los Angeles 



• • "ABSTRACT-" " \ 

• '-x 

• ? , 

.In diagnostic testing one purpose of a test might be/to determine 
whether an examinee ha^ acquired the appropriate skills- for solving -■ 
certain types of problems, or whether the examinee is using an erroneous 
algorithm. In the latter case it is also desired to determine' „which_of • 
several erroneous algorithms is being used so that remedial, training -can. 

*5 '*.'••• 

be qiven. Birenbaum and Tatsuoka (1982) recently illustrated that when 
testing eighth graders on the addition of signed numbers, examinees might 
indeed be applying one of several erroneous algorithms, and more recently 
they reported results on a scoring procedure for" this situation. This ' 
paper describes a simple latent class model for handling the items, in 
Birenbaum and Tatsuoka; included Is a description and illustration of 
how to derive the optimal scoring rule when multiple .choice test items 
are used. * * 



Birenbaum and Tatsuoka. (.1982) provide an interesting example of the 
need to measure and classify examinees according, to the type of misinfor- 
mation they "have relative to a particular skill. They were specifically 
concerned with testing the addition of signed numbers^, but it is evident 
that *similar problems occur in many, situations. As Birenbaum and Tatsuoka 
point but, examinees might be using one of several erroneous algorithms 
when responding tOj>these items. They described three algorithms that 
were actually ufsed by examinees, and since they play an important role 
here,' they are briefly reviewed. t , 

The first erroneous algorithm was treating parentheses as meaning 
absolute, value. -Thus 7+(-3) would result in an answer. of 10. The second 
algorithm was to add the two numbers and take the sign of the number hav- 
ing the largest absolute value. For example ^ if asked^ to compute 3+ -7,^ 
the exami nee -adds 3 and 7, t and because 7>3, a negative sign is added yield- 
ing -10.- The third erroneous algorithm was to add* the two numbers when 
they had different signs, and to put a plus 'sign in the result. For exampl 
3+ -7=10 according to this rule. If the two numbers,, have the same sign, 
the student takes, their difference and puts the common sign in the result. 
For 'example, (.-8).+(-4) ~4. This last algorithm resulted from the student 

misunderstanding how to use the number line as 4t was explained by the 

• * ■■ ■ * 

teacher. Table 1, taken from Birenbadm and Tatsuoka (1982) shows several 

• / 

addition problems and the resuTts arrivfed at according to the three erron- 

■ * 

eous algorithms just described. Note that different algorithms can'yield 
the same answer, and in some cases even the correct response. 



Birenbaum and Tatsuoka (19S2, 1983) argue for the need to measure 

- J '•'•.'.•..■•/".• 

misinformation and to determine the type of misinformation that, a student 

has. In their rcore recent article (Birenbaum & Tatsuoka, 1983), they 
compared two.~scoring algorithms for measuring misinformation, but no results 
were given on determining the accuracy of either procedure, and indeed 
neither procedure was developed witlr the goal of finding the optimal scor- 
ing procedure for identifying^whettfer an erroneous algorithm is being used. 
(They compared coefficient alpha for the two scoring procedures, but this 
is not a direct measure of the accuracy of the test as it is defined below.). 

The goal in this paper is to illustrate how, an optimal scoring pro- 
cedure can be derived for the situation considered in Birenbaum and Tatsuoka 
(19§2). As will become Evident, th'e process used for determining the opti- 
mal scoring rule can be easily extended to otjiar situations,, but to keep 
the illustration as simple as possible, attention will be restricted to th.e 

items in table 1. An additional/advantage of the results tc be given is 2 

that expressionsSire also derived for the probability of correctly determfh- 'v 
ing the algorithm being used by an examinee. ; ; *\ \ . 

Before* continuing, some comments .should be made regarding results 
similar to the developments/made here. First, the problem being exa'mined* 

. \ / ' \ , . . \ 

is similar to one considered by Macready and Dayton (1977). It is\eastly — 
seen t'hough that the latent structure model they used is inappropriate for the 
problem at hand*. Wilcox (.1982a) proposed a model for measuring misinfor- 
nation via an answer-until-correct scoring procedure; but this modems in- 
adequate here as. well. -The reason is that his model can measure only one^, 



type of misinformation, and here. the problem is contending with three 
erroneous algorithms. Dayton and Macready (1980) as well as Goodman (1974) ... 
.describe very general latent class models that could be applied, and' Bergan 
et al . (I980)^desx5r1'fied an appropViate scoring procedure. However, these 
models-require iterative techniques that may be unnecessarily complicated. 
In particular, Dayton and Macready 's model requires iterative approximations. 
of the maximum likelihood estimates of th^param^ters, and for theoretical 
reasons it is best to avoid these estimation techniques whenever possible 
(Kale, 1962a; 1962b).* The problem is. determining whether iterative esti- 
mati on procedures converge to the maximum likelihood estimates that they 
are intended to approximate. It appears that they usually do, but. there is 
no guarantee that this will always be the case. (For a situation where 
iterative techniques can converge to inappropriate values, see Wilcox, 1979.) 
Thus, an important aspect of this paper is that by making certain assumptions 
about, how examinees behave'when taking test items, which are motivated 



by a published empirical study described below, a relatively simple. model 
results where explicit maximum likelihood' estimates^of the parameters are 
available, and these estimates clan' be used to solve the measurement prob- 
lems described above. ' " 



2;" The Model .and Its Assumptions * — ~> 

It is assumed that multiple-choice test items are used, and that every 
item has t alternatives. This la'st assumption is made primarily for notational. 
convenience. Using multiple choice items introduces the ^probl em of guess- 
ing, but this seems to be easier to handle, from a -statistical point of viow, 



i 



O . • Q 



than is the problem of careless errors which is one of the erroneous 
algorithms also considered by Birenbaum and Tatsuoka (19^82). Here it 
is assumed that- careless errors occur with probability close to zero 
sc that for practical purposes this error. can be ignored". As, explained 
in the introduction, only the three erroneous algorithms in Birenbaum and 
Tatsuoka will be considered, plus, of course, the al gori thr. of random > x 
guessing. Thus, for the population of examinees to be tested, it is 
assumed that every examinee belongs to one of five mutually -exclusive 
latent states: they know how to so^ve the items, they guess at random, 
or they apply one of the three incorrect algorithms described above f . . It 
is also assumed that if an examinee is\using trie correct algorithm, the 
correct response is always chosen, and i\f one of the three erroneous al- 
gorithms is used, an exami nee^ri 1 1 always\ choose a corresponding response 
For comments' about this las^ia&sumption, see section 6. For the moment 
it is also.assumed that every item has a diktractor that is consistent 
with each of the erroneous algorithms. This\restriction could be relaxed 
if desired, when applying the procedure outli bed in 
section 5. Another assumption is that there are no examinees who have, 

partial information! Although empirical results indicate that partial 
\ • . \ 

-i'nformati oh- exi sts in some si tuations Ce } g >. , JCopmbs et al . 1956) . there 

• "\ . 

is also some empirical evidence that when deal ihg\ with misinformation, 
it may be reasonable to assume that no examinees hkve partial information 
(Wilcox, 1982a). Jt.is not being suggested that this assumption be taken 
for granted, only that it might be reasonable in practice— section 4 dis- 



cusses how certain implications of the'model can be tested, and this. test 
should always be carried out. 

. " .The next step is. to find n items that make it possible to distinguish 
between an^ two examinees having a different erroneous algorithm. In 

addition, these items should 'include at least one item that will result 

» * - 

in at least one^incorrect response when an erroneous algorithm is being 

s ■ * 

used. These last two conditions are clearly satisfied for th^ items in 
Table 1. In fact^only the first, three sterns are needed. 

. "Let 3 1 and 0 represent a correct and incorrect response to an item, 
.respectively. Consider an examinee responding to thefiijst three items > 
in Table l.j If the* first erroneous algorithm is used, the resulting re- 
spc^nse pattern will be (1,0,1). For the second erroneous* algorithm, the 
response pattern will be (0,0,1) , and , the third erroneous, algorithm will 
give (0,0,0). Thus, if an examinee has response pattern (1,0,1), for 
example, the assumptions' of .the-model rule out the possibility that the 
examinee is using one of th£ other two erroneous algorithms, and so the 
examinee is either applying the fi r *t erroneous algorithm or is guessing> 
at random. \ * ' 

Two practical problems will be considered. The first is estimating 

• » 
the proportions of examinees among a population of examinees who belong 

to the various latent states. The second problem is* distinguishing be- 

tween those examinees who are guessing at random/ <and those belpnging td 

one~of~the four other latent classes. As will become "evident, a solution 

to. the first problem can be useful when solving the second^ 



y 



Let ? 'be the proportion of examinees who know the correct algorithm, 
and let c. (i=l,2,3)«be the proportion who are using ttfe ith erroneous 
algorithm. Finally, let 5. be the proportion of examinees who guess at., 
random, and let p i . R ('i = l,l; j=0,l; k=0,.l) be the probability that a ran- 
domly sampled examinee would give the response pattern (i,j,k). For, ex- 
ample, p, 01 is the probability that a randomly sampled examinee would, 
give a correct, incorrect and correct response to the first three items 
in Table 1. From the assumptions already described, it follows that 

Pill = * ^4 (1/t)3 ' • * ' (2 * 1} 

Pici = ?*CVt) 2 (i-Vt)+ci • $ ' ' ( 2 - 2 ) 

Pool - = ?4Cl/t)U-l/t) 2 +? 2 (^-3) 

• * P 0 oo= ^W 3 ^. ' , " ' . . - (2 ' 4) . 

' P 110=P0ir ^Cl/t) 2 tl.l/t) , ■ ,C2.5a) 

, PlQ0 = P010 = C 4 .U/t)(l-yt) 2 ■ J ■ - ■ ; V ^ U,5b), 



For N_ randomly sampled examinees, let x:^ be the number of examinees 
having response patter n-(i ,j ,k) , arid let be the common value of Phq'Pqi] 
and let^ 2 5Pj^=P 010 be the common value of P 10 o = Po;0' From standarc * " re " 
. suit's on the multinomial distribution in conjunction with results in Zehna 
• C1966), q^=(x 1 ^ 0 +x 011 )/2N and. q 2 =(x 1 o 0 +x oiO )/2N are ma * imuin likelihood 
estimates of and q 2 - It follows that maximum .likelihood estimates. of 

•5 4 »?3»?2 ,? 1 and X are 



1. • 



^(^(l'/ti-^i-i/tr^gCi/tr^i-i/t)" 1 )/^ \ /. (2.6) 



t'tm'^f] ' ' : (2 - 7) 

•c x n^bi^iC^VtUt 2 ~ r * T ~.V . . . (2.8) 



.*2 VOOI/P - 5 4 Cl-l/ti 2 /t • ■ . ' (2.9) 



and 



3 



? 3 = W N " 5 4 a -V^ ■' • •"' (2 * 10) 



Making Decisions Abdut 'an Examinee's Latent State . 
. Suppose an examinee gives the response pattern (1,1,1). Then accord- 



ing.to the model, the examinee is either^using correct algorithm* Qr^. 



♦ •" ' ' 

choosing responses at random; the problem iS determining which is true. 

^ . * .'' 

The simplest 'sol uti on is to examine the probability of observing the re- 

fponse pattern (1,1,1) if the examinee is guessing at random; this is just 

'-3 "~ 7 " > *^^-^ « -3 ^ 

t" assuming the responses, arSMwlependent of one another. If t is^mall, 

it might be decided that the examinee is jiot^gxtessing^ but this approach 

can be unsatisfactory. To illustrate why, suppose s=fr^Xhen the optimal 

'scoring rule would be to always decide an examinee does not knowT^and^t^ 

conclude therefore that an examinee is guessing at random when the response 

pattera N (l,l,l) is given/ " - ' 

The question arises as to whether the optimal rule for c=0 is also 

optimal when 5>0, and if so, how far away from zero can c be before some 

other rule should bemused. There is* also- the problem of determining the 

overall accuracy of the decision rule being used; A solution to the: first 

problem* is to decide an examinee is using the correct algorithm if and 



only if the response pattern (1,1,1) >s given and - 

• c 4 t" 3 <.. • • - . • . ; <2-"> * ; 

This rule is derived by noting that the joint probability of randomly 
sampling an examinee who guesses, at random and who gives the response 
(1,1,1) is just c' 4 t" 3 . Also, the joint probabil ily, of sampling an examinee 
who knows and who gives the response pattern (1,1,1) is £. Thus, if 
.?<? 4 t" 3 , decide the examinee is guessing at random. Optimal properties, of" 
this decision rule (.given in a more general context) are described by 

Copas (1974). - ■ ' 

A similar approach can be used to derive decision rules for determin- 
ing Whether an examinee is using' a particular erroneous 'algorithm. Suppose, 
for example, the response (1,0,1) is given. Then ^coj^g.t0_thej no d el > 

the examinee either is usirjg the first erroneous algorithm, or is guessing 

i ' , 

at random. The^optimal rule is to decide the examinee is using the/ erron- 
eous algorithm if and only if „ 

C^^Ct^Cl-t- 1 )] • ' (2.12) 

where/ t" 2 (Hl/t) is the probability of the response pattern (1,0,1) from an 
examinee guessing at random. Thus, this is the same rule as (2.11) except 
that; t, has been replaced by"'c 1 ^and t has been replaced by the probability 
of the response pattern (1,0,1) from an examinee guessing at random. Similar 
modify cations are made- for the" other two response patterns corresponding to 
the other two erroneous algorithms.. As for the response patterns correspond- 
ing toUquatrrons-(^^-^ply^ at random.' 



C Extensions to n Item Tests Having;/ Latent States 

• - : -, • ""'/" y , > -v .-, 

The basic process^Tused to analyze' the^ first three items in Tab! fe 1 

is easi ly extended to n item-i tests^tnvol.ving k latent 'states Consider . 

any response pattern A where. A-is^a vector of l ! s and 0 ! s. Let C, be the 

joint probability of observing A and having an examinee* i fu the kth latent 

I « 

tate, k=l,...,K. Then decide that an examinee is in the ith latent, state 



if C=max C.. Another Illustration, is given- in' section 5.. As already 

IK- ■* EJ 

mentioned, when trying to classify an examinee as belonging to one 
of two latent states, this rule is known to have certain optimal prop- * 
erties (Copas, 1974).; If an examinee giving response A can belong '< . 
to more than two latent classes, the rule used here is the same as 
the on? usjejd by B er g a n et al . (1980), but the optimal properties 
discussed by Copas C1974) have not been established. 

3. • The Probability of a Correct Decision 

Returning to the analysis of the first three items in Table 1, suppose 
the procedure in the previous section ha* been applied, and that a scoring 
rule has been determined. The next problem is determining whether there is 
a high likelihood of making a correct decision-about the latent state of a 
°r§mdomly sampled examinee. If this probability is judged to be too low, the 
test* might be modified as described below. Again to explicate the process, 
only the K-5. latent states of section ;2 will be considered* 



-10- 



Suppose the decision rule in Table 2 is to be used. Then for a 

randomly sampled examinee, the probability of a correct decision (PCD) 
is just 

C + Cj + .'C 2 + C 3 + B ?4 ' " (3.1) j 

where , . - 

B = 2t" 2 (l-t" 1 )+2t^ 1 (>t' 1 ) 2 . ' * • 

Suppose instead that for response pattern (0,0,0) it is decided an exam- 
inee is guessing at random. Then (3.1) becomes 

+ 5j + C 2 + Cc 4 * - ( 3 - 2 ) 

• where C=B+t' 3 .' Similar adjustments can be made if the decision rule in 

* a 

Table 2" is modified in any way. 

The general^ technique in determining an expression for. the PCD is to 
first' derive an expression for^theHw=oJmbliity^hat, for an examinee guess- 
ing at random, the observed response pattern will correspond to one where 
the .decision. is made that the examinee is' indeed guessing at random. Conr 

' sider, for example, response pattern (1,1,0). Given that the examinee is 
guessing at random, th'e probability of this response pattern is 
(t" 1 )(t' 1 )(l-t~ 1 )'. Repeating this process for every response pattern .for 
which it is decided that an examinee is guessing $t random and adding the 
results yields the coefficient for ^ in the expression for ..the PCD. .; For 

' the decision rule in Table, 2, there are four such response patterns, .and 
they .add to B in (3.1). For examinees iji the^other latent states, the - 
response pattern is determined with probability one, and so no coefficients 



V 



9 

ERIC 



16 



are needed for them in (3.1). If the PCD is judged to be too small, ad- 
ditiona] items or more'distractors can be used, and then the process de- 
scribed above is applied again. a ^ *'/. 
'•0' I • ' ' ' • ' ; , * ' . 

4." " * • Comments About Testing, the Model \ 

\ r . :, m f , / " - . ' • ■ i " • 

% ' A' partial check an the model in section 2 is to test C2.5)~w.ith the 

usual sign test. In the more general case, euch as IhsectiorT 5, it isV 

necessary .to test fjpn equal cell probabil ities, among several cells, and. 

.this is usually accomplished 'with a chi-square test. The purpose of this 

^section is to make some brief comments about this weffl. , krjpwn procedure. 

First, exact tests 'for. equi probably eel Is can be made when N, the 

\ number of examinees, is less than *or equal to 50 (Smith et al., 1979; 
• ' -J . v C , . " v 

Katti, 1973). Ln cases where a chi-square distribution must be used jto ■ 

get approximate critical values, it appears .that a better approximation • 
of the .critical values can be had by applying results^ih Wilcox (1982b). 

Second, a practical problenv with testing for *equi probable cells is 
that the ntil V hypothesis might- be rejected even when t|ie. cell/ probabil- 
ities are nearly , equal in value. VOf course this is particularly likely 
to happen when N is large. Accordingly, if the chi-squre test is.signif- 
icant, ft would se^n- prudent to estimate the overall inequality among 
the cell probabilities, and a detailed discussion about how this cart be 
done can be found in Wilcox, Cl^ff .and Embretson V(to appear). * V 



5, ^ ~Y An Alternative Approach 

In some casesHt may be'* useful to take into account the- actual re- 
sponse chosen by an examinee rather than limiting the analysis to the 
pattern of correct and incorrect responses. By doing this, fewer items 
may be required in order to obtain an accurate test. 

As an illustration, suppose items 1 and 5 in Table 1 are to be used 
If the observed 'responses are -4 and -32, respectively, the examinee was 
either guessing at random or was using. the correct algorithm. Thus 
v Pr(-4,-32)=5+ ?4 (t' 1 )Ct"^). As a more specific ^example, suppose there 
are t=3 alternatives for both items 1 and 5 >s in Table 1. Then , 

- ' ?j = PrC-4, -32) = c+C 4 /3 2 

where the symbol £ is introduced for notationat convenience. In a simil 

manner the probability of ,all possible response patterns can be written- 

in terms of the ?'sand they are 
5 2 =PrC-4,32)- Cl +c 4 /9- 

5 3 « PrC-4,-14K 4 /9 - 

5 4 = PrC-10,32)= ?4 /9 

5 5 ■ Pr(-10,-32)=c 2 + ?4 /9 

5g = Pr(-10,-14)= ?4 /9 \ / 

?7 =*PrClO,32)=c 4 /9 

5 8 = Pr(10,-32H 4 /9 



Observe that ? 3 ,= ? 4 = 5 6 = ? 7 = ? 8 S and so the comments J n section 4 
apply. ■ . * . ' : : 

Let be the 'usual maximum likelihood' esti&iatg of § : Then the 
set of equations just given imply that 

■ * '. ^ » 

is a maximum likelihood estimate of c^. Hence 

AAA 

and ' •-•.*•• ■ \ 

» AAA 

? = ?! - 5 4 /9 

are maximum likelihood 'estimates of ^, s 2 » Ci and 5 respectively. Thus, 
only two items were needed to estimate the J proportion of examinees in the 
five latent stages 

Next suppose the s's are known or that they have been estimated, and 
that a scoring procedure, must be established. Consider, in particular the 

response pattern (-4, -32). The joint probability of using the correct 

■ i * c 

4 . . * 

algorithm and giving the response pattern ,(-4, -32) 1s just 5. .The joint 
probability of guessnng at random and giving the response pattern (-4, -32) 
is c^Ct" 1 lCt" 1 I»;c 4 /9. Thus,. for the response pattern (-4, -32), if 
C>C 4 /9~decide an examinee is using* the correct algorithm. If s<s 4 /9, de- 
cide the examinee is guessing at random. The. important point here is that * 



the analysis 'is basically the same as it was in section 2. Of course the 
other response patterns can be analyzed in the same manner. An expression^ 
for the PCD can alsl be determined* once a scoring rule has been settled f 
upon. The details are basically the -same as before, and so further comments 
are omitted. 

6 J^OohcTuding Remarks 

In section 2 it was assumed that every item has an alternative that 
is consistent with at least one of the algorithms that might be used by ^ 
an examinee. It should be noted that if computerized testing is possible, 
an. adaptive test could be administered that relaxes this assumption. , Sup- 
pose, for" example, an item is given and that the observed response rules 
out the possibility that an examinee is using the first erroneous algorithm. 
Then, the next item could be chosen based on the assumption that the exam- 
inee is not using this algorithm. That is, the distractors need not in- 
clude an ..alternative that is cons-i stent wi th the -.first erroneous algorithm. 
When measuring complex skills, this approach could be important. 

One of the assumptions of the model was that -there is no carelessness. 
That is, if an examinee 1s using a particular algorithm to arrive at an . 
answer, the alternative corresponding to this algorithm will always be 
chosen. . In some cases 1t might be necessary to include the possibility 

* " « " « 

that an examinee might carelessly choose an alternative that 1s Inconsistent • 
with the algorithm being applied. The models used here are easily extended to 



handleltfiis problem, but,, iterative estimates 0 of the parameters would be needed 
One way\tb solve this estimation problem' is to proceed as outlined in ■ 
.Goodman (\1979). Ofice ^the parameters' are estimated, a- scoring rule can 
be derivedvas was outlined above, / , 



Another important ppint is that the scoring rules described here 
are based onVhe assumption that jthe goal is to maximize the number of 
examinees for Whom a correct decision ismiade about their latent state". 
This could mean ^however, that an examinee could get an item right, and 
yet it would still be concluded that an erroneous algorithm was being 
used. If this possibility is objectionable, some other scoring rule 



should be considered.^ However, the results given here are still valuable 
because they yields method o % f assessing the accuracy of a te^ if a con- 
ventional scoring rule is applied, and the scoring rule described' here -might 
be useful when evaluating the effectiveness of. a particular instructional 
program. .' \\ 4 



) 



-16- 

40 



TABLE 1 



Problems and Responses "According to the Three Erroneous 
Algorithms in Birenbaum and Tatsuolca (1982) 



Problem No. 



% 3 



1. 3 + -7 = -4 

2. 7 + 1-3) = 4 
-6 + -15 = -2i 
-6 + +15 = 9 



Erroneous Algorithm 
1 2 



5. C-23) + 1-9) = -32 



-4 
10 
-21 
9 

•32 



■ -10 
10 

-21 
21 

-32 



10 
10 
-9 
21 
•14 



ERIC 



P 



22 



ERIC 



•17- 



T^BLE 2 ' ' * 

«< 

A Decision Rule for the First Three Items in Table 1. 



Response Pattern of j 

Corrects and Incorrects Decision 



111 ' Uses the correct algorithm 

110. Guessing at random " 

101 Uses the first erroneous algorithm 

Oil " 1 Guessing at random - 

100 Guessing at random 

010 Gue?sing at random 

001 i Uses the second erroneous Algorithm 

000 Uses the third erroneous algorithm 



X 



0 

ERIC 



■18- ■ 



References 



, Bergan, J. R. , Cancelli , A- A., & Luiten, J- W°. Mastery assessment with > 
latent class and quasi -independence models representing homogeneous 

item domains/ Jourh al of. Educational Statistics , 1980, 5, 65-81, , 

* — ; ; ~ " w ^ 

Birenbaum, 'M., & TatSuoka, K. Oh the dimensionality of achievement test 

data. Journal of Educational Measurement , 1982, 19, 259-266. 
Birenbaum, M. , & Tatsuoka, K. TJie effect of a scoring system based on 
.the algorithm underlying the student's ^response patterns- on the di- 
- mensionality of achievement test dajta of the problem solving type. 
Journal of Educational Measurement , 1983, 20, 17-26. : 
Coombs, C. H 1 hoi land, J. E. , & Wome'r;. F. B.' The assessment of partial 



knowl edg e . Educational and 



Psychological -Measurement , 1956, 16, 13-37. 



Copas, J. B. On symmetric compound decision rules for dichotomies ... Anna! s 

■ . . s . 
Of Statistics , 1974, 2, 199-204. : . ' — 

... * ' • . — * \ c ' 

Dayton, C. M. , & Macready, G. B. *A seal ing, model with response errors and 

intrinsically unscalable respondents. Psychometrika , 1980, 45, 343-356. 
Goodman, L. A. On the .estimation of parameters in latent structure analysis. 

* v -r. o ■ 

X : * . . > •• 

Psychometrika , 1979; 44, 123-128. . . • 

. ... . ( 

Goodman, L. A. Exploratory latent structure analysis using both identifiable 
and unidentifiable models." Biometrika , ,1974, £1» 215-231^ 

Kale, B. K. • On. the solution of likelihood equations by iteratio'n processes. 
The multtparamfrtric case. Biometrika , 1962, 49, 479-486. ta) ' 

Kale, B. K. A note on a problem in estimation. Biometrika . 1962, 49, 
553-557. lb) , .. • r' 

■ ■ ;. • i • • > * • ' 

w . . s . 



Katti, S. K. Exact distribution for the chi -square test in the one v/ay 
table. Communications in Statistics , jL973, 2, 435-447, 

Macready, G. B., & Dayton, C x . M. The use of probabilistic models in the 
'assessment of mastery. Journal of Educational Statistics , 1977, 2, 

to 

199-120. 

i 

Smith, P. J., Rae, D. S. ,-Mafiderscheid; .R. W., & Silbergeld* S. Exact 
and approximate distributions of the cJhi-square statistic for'equi- 
probabil'ity. Communications in Statistics Simulation and Compu- 



tation ,- 1979, B8, 131-149. 



Wilcox, R. -R.-\ Estimating the parameters of the beta-binomial distribution 



Educational and" Psychological Measurement , 1979, 



31,527-535. 



.Wilcbx, R. R. Some new results °on an answer-unti 1 -correct 'scortncj procedure. 

» „ • ■ v . . . * ■ - . " t • 

Journal of Educational Measurement , 1982, 19, 67-74. (a) . 

2 ■ ""I * 

Wilcox) -R*. R. A comment on approximating the X distribution in the equi- 

. ;..| 

probable case. Conrouni cation in Statistics —' Simulation and' Compu- « 
taticn, 1982, 11, $19-623, (tt) 
Wilcox-; R. , Cliff, N. j & Embretson, S. Measuring mental abilities.; Adyances 

in statistical theories; Beverly "Hills; Sage' Pu&li^hjjng Co, , to appear 
Zehna, P. W. Invariant of\ maximum likelihood estimation; Annals of . 
• • Mathematical Statistics,! 1966; 37, 744. 



MEASURING MENTAL ABILITIES WITH 
-LATENf STATE MODELS 



Rand R, Wilcox 
Center for the Study of Evaluation 
University of California, Los ^Angeles 

"... ) ' '• 



26 



ABSTRACT*. . 




The three goals in this paper are (1) to 'review the latent state 
'nodels'that have been proposed for measuring aptTtaae^d^trtevementr 
C2) to outlAe the measurement .problem's that can now be solved with •„ 
latent state Liodels, and (3) to discuss how latent state and latent 
trait models /are related. It 'is pointed out that latent* state and 
latent trait- models measure different things; that are related^to one , 
another in a complicated fashion. 



1. INTRODUCTION 

There are now. four .interrelated approaches to measuring aptitude 
and achievement that are based on different notions of true scores. Class- 
ical test theory' is the bes£ known approach where ability is defined in ^ 
terms of a propensity distribution. The other three are latent trait 
models, item sampling models* and latent state models. No doubt latent ,' v 
state models are the least well known. One reason for this is that early 
models' made very restrictive or inconvenient assumptions, arid even if 
the models could be applied* it was unclear how to 'solve the many mea- 
surement problems that arise in practice (cf. Meskauskas, 1976). 

Today the situation has changed radicaliy, there are now latent state 
models' that are 'relatively easy to use, and empirical investigations In- 
dicate that the underlying assumptions are usually met, or that they are 
reasonable approximations of reality.- Just as important is that many mea- 
surement problems can be solved that were previously impossible to*address. 
The three major goals in this paper are to (1), review the various latent 
state models, (2) describe some of the measurement problems that can now 
be solved with latent state models, and (3) briefly indicate how latent 
trait models, item sampling models, and latent class models are related 
to one another. The; last goal is particularly important becausk when there 
are errors at the item level such as guessirig, ,all three models estimate ■•■ 
. different quantities that are related to one another in a complicated ~ . ; 
, fashion. In fact, if a measurement problem ts formulated in. terms of, one 
model it may be very difficult to find a. satisfactory reformulation' of 



the problem in termi of another mpdel. .This point is elaborated below. 
Accordingly 3 it is important to consider the differences among the models 
when addressing a particular measurement problem. 

It should be stressed that none of the models described below are 
considered to be always bad or inappropriate. The position advocated here 
is that an eclectic approach to measuring mental abilities should be used. 
That is, the. choice of a true score model should be dictlrted,_at least in 
part, by the goal of the test, or the type of ability being estimated^ 
All that is being suggested is that different models are based on differ- 
ent constructs, and so they estimate different things, which suggests' 
that some models may be- inappropriate in some situations, or that several 
models might be used to study a test. For example, the type of guessing 
examined in latent stattf models is completely ignored in all other models, 
and so if this type' of guessing is deemed important, a latent state model 
should be used, There is a .widespread . belief that the guessing parameter 
in latent trait models is the same as the notion of guessing in latent 
state models, but this is not true. ■ In section 6 an attempt is made at 
explaining the difference. 

, The paper is organized a? follows: Section 2 briefly reviews the 
basic elements of J a tent trait models that will be needed in the paper. 
^Section 3 does the same for item sampling models, and some comments are 
made about how thfcse models relate to latent trait models. Section 4 
reviews the theoretical developments in latent class-models that are 
specifically intended for measuring aptitude and achievement. Certain 



aspects of these models were reviewed by Macready and Dayton (1980b) 
and so these features wilj not be discussed hereV Section 5 describes 
applications that can not be addressed by other measurement models. 
* Included are generalizations of item sampling models. Section 6 makes 
additional comments on how latent trait and latent class models are re- 
lated to one another. In particular, this section discusses the impor- 
tance of guessing in latent trait models, and it points out that the 

4 

type of guessing examined in latent class models is completely ignored 
in latent trait models--even in Birnbaum's three parameter model. 



2. Latent Trait Models 

Latent trait models are discussed in detail by Birnbaum (1968), 
Lord and Novick (.1968, ch. 16), Lord (1980).,, and Hambleton et al . (1978) \ 
give an* excel lent review of this approach-to mental test theory. See 
also, the 1977 special issue of the Journal of Educational Measurement , 
Weiss and Davison £1981), and the 1982 special issue of Applied Psycholog- 
ical Measurement . 

Generally, these models express the probability of an examinee giving ■ 
a correct response to an item as a function of an examinee's "ability^ 1 
fcnd certain item parameters. For example, the Rasch model ,ppstu]ates 
that p(e), the probability a specific examinee with ability level 
8 (-« < e <®) will produce the correct response to a dichotomously scored 
item, is *~ ' - 

p(ef = exp(e - b)/(l + exp(e - b)) (2.1) 
where b (the difficulty: level) is a param|ter that characterizes the item. 
(See, for example, Wright, 1977; Mai ner et al., 1980.) 

An alternative expression for p(a) is the two parameter normal 
ogive model given by * ./• 

P(8) - f L *tt)dt / ' ' . ' (2-2) 

where <fr(t) is the standard noraal probability function, L = a(e - b) , and 

a is the item "level of discrimination". A closely related model is the 

two parameter 1 ogi sti c model where , . „ 

p(e) = (1 + exp(-1.7a(8 - b)))" 1 ( 2 -3) 

(Birnbaum", 1968, p. 400)1 An even more general three, parameter model 

is given by *'] '' . ' ' . v ' 

ex p('l.7a(e - b) s {2.4) 



v/h 0 ere c is the 'probability of a correct response from an examinee with 

low ability. In all of the abOv^models, the symbols a, b, and c 

represent unknown parameters that are' estimated with the observed scores 

of a sample of examinees. A particularly important feature of latent trait 

^models is that once the item parameters are estimated, it is possible. 

to construct a test so that the expected observed scores will havecertain 

« • 
properties that are deemed important. ' 

v 

Numerous articles on latent trait models have been published. 
However, as previously indicated, the goal of this paper is not to summarize 
these results. For present purposes, the important point is^ the interpre- 
tation of p(e). One interpretation is that p is the probability of a 
correct response over repeated^independent administrations of the item. " 
In other t words, p is the examinee's expected obseryed score, where the 
expectation is defined A n terms of a propensity distribution.' However, 
Lord (1980, ch. 15; 1974) argues that this interpretation leads *to certain 
logical problems, and so he proposes that one of two other interpretations 
be used instead. The first imagines a pool/ of items all of which have 
the same item parameters a, b, and c. Then p(8) is the probability that 
a specific examinee. with ability e will give the correct response to an - 
item randomly Sampled from this item domain. The actual Items on a test 
will typically have different item parameters, and so each of these items 
would be* viewed as being sampled from an item domain corresponding to the 
values of a, b, and c. . . / - 

The second interpretation views examinees, rather than items, as 
being randomly sampled. Fo r an item with parameters a, b, and c, p(o) 
is the probability of a correct response froirr a randomly sampled examinee 
who has ability level e. ■ - r'-; r ^-.: . 



Some other basic^assumpt ions associated with latent trait models 
should be mentioned. One of^thesejs the assumption of local independence. 
This means, that given 0, responses are^fndependent of one another. 
Letting p. be the value of p(8) for the ith itenTon^a test, /local. in- 
dependence means that if items are scored dichotomously , the- probability 
of y items correct given* e is m 

f(y|e) = s n p* 1 ( 1 - p,) 1 "^ . ^ ; (2-5) 

i=l o . - 

< where x- = 1 or 0 according to whether the ith item is answered correctly, 
and where the summation is over all vectors (x^ , , x R ) such that zx^ = y 
A test of^this assumption was recently proposed, by Holland (1981)., but 
it has not yet been applied to real data. 

Another property of the most commonly used latent trait models is 
that they are unidimensional . This means that only one person parameter, 
namely 6, is needed to determine the probability of a correct response 
to an item. McDonald (1981) points out that latent trait models can be 
viewed as a nonlinear factor analysis model with only one factor (cf. 

"MeUengergh, 1981). 

Another observation will be useful later. This is that if all the 
items on an n-item test have- the same item parameters (i.e., the same 

-values for a, b, and c) then (2.*5) reduces to 

f(yie).= (;)p y (l -P) n " y ' ^ ' >•« 

-.the binomial probability function, where p is the common value of the 

Pi's. /r-}: : : l;'r- : , .. \ 

Finally, for multiple' choice items, latent trait models do not deal 
with the construct "knowing" in any way --they deal with the probability 
of a correct response whi ch i s di f f erent from the probabi lity of knowing . 



3. Item Sampling Models }" 

' * " 

A third class of true score models is, known as item, sampling models. 
The binomial error model is the one most frequently useH; a recent: review \ . 
1s given by Wilcox (1981a), and so only its basic properties will be \ • 
given here. ' \ i .■; . 

Consider a single examinee responding to an n-item test. One situation 
leading to the binomial error model is where the n itfems are actually , 
sampled from some larger item domain. If S is the proportion of items ' 
the examinee would get'correct if every item in the item pooV were* attempted, 
than ths'probability of y correct responses is ' 



(yU) ? U) 



y n -r) n ' y ; (3.i) 



(It is assumed that sampling is from an infinite pool, or finite pool 
with replacement, and that*? remains constant over the trials.) In 
many situations items are not randomly sampled, and there is no item 
pool. Thus, there is no a priori reason for assuming (3.1) holds. It 
might seem, therefore, that the binomial error model is not really justified, 
but the point is that (3.^) might give a good fit to data. Indeed, the em- 
pi ricaV investigations cited by Wilcox (1981a) suggest that (3M) will 
frequently give good results when ^addressing various measurement problems. 
Note that there is also no a priori reason for using latent trait models 
(Lord, 1980). Again the crucial question is whether the models give good 
results with real data. \ 

It might appear that the binomial error model is more restrictive 
than latent trait models 1n the sense that if the item parameters a, Vb, 
and c are the same for every item, the probability of y correct responses 



is given by (2.6) which is the .same form as (3.1). In particular,, one might 

conclude thate in (3.1) and p in (2.6) are the same. They are related 

but in a more complicated fashion. 

, Typically", the n items on a test will have different values for a, 

b, and c. If items are really sampled from some item domain, the corres- 

ponding item parameters will have some distribution, say g(a,b,c). 

Thus, for a randomly sampled item, the probability of a correct response 

from a specific examinee with ability level e is ? = E(p(e)) where the 

expectation is taken with respect to the random variables^, b, and c. 
That is, p(e) g(a,b,c) Hadbdc. 

To illustrate the practical implications of this result, consider a 
criterion-referenced test where the goal is to determine whether an 
examinee's, percent correct true score £ is above or below. the known 
constant Cq. Is it possible to formulate the problem in'termsjjf a 
latent trait model? In particular, how can a criterion score be found 
(a value of 8 Q ) that corresponds to 5 Q . ■ If the suggestion In Lord (1980, 
p. 174) is followed, one might determine the criterion score to be the 
value of e such that 




where p.(e) = p(e,a.*b.,c.) is t^e item response function for the ith 
item on an n-i tern test. The point is that if a different set of items 
were used with presumably different item parameters , equation (3.2) would 
yield a different criterion score. Thus, this procedure yields, at best, 
an estimate of what the criterion score would be if the problem were to 
be reformulated in terms of e Q . 

Observe ,that n5 0 - is different from the true score used by Lord (1980, 
p. 174). Lord is referring to an^expected number-correct* true score, but 



the expectation is different fromjfi5 0 in /equation (3.1), as was explained 



above* 

Does this mean that one rnddelHsii better than another? The answer 

is an unequivocal no; the point is 'tftat they are not .exactly the same, and 

>'•'( ■ • ' . 

the choice of a model should depend on what an investigator wants to know. 

/ . •, • 

Of course, some individuals might be dissatisfied with both models. 4 In 

"~ r - ■ ." . • . 

terms of a criterion-referenced test, at; least three alternative approaches 

are possible. The first is to simply specify a passing score on a test m 

without any reference to some notion of true score. (See Huynh, 1976; 

Subkoviak, 1976; Wilcox, 1979a.) The second is to take the view that 

examinees' either know or do not know the answer to an item on a test, and 

the goal is to determine which of the n items an examinee really knows. 

The third view is that the items represent a larger domain of items, and 

the goaUis to determine the proportion of items in the item pool that 

■ ■•'}' 

the examinee knows. The latter two views are discussed below. 

4. Latent State-Models 

..... 3 ' H 

Latent state mojdels (also known as latent structure or latent *class . 
models) have existed for some tW (e.g., Lazarsfeld & Henry, 1968; . 
Lazarsfeld, 1950)* One of the original applications $as measuring attitudes 
(Stouffer, 1950), but only situations involving aptitude and achievement > 
are considered here. Also there are continuous latent structure models 
that are similar to latent trait models, but only discrete models are 
discussed. .*. - 

A basic premise in latent state models is that in terms of a 

... : - 

specific item, examinees can be described as belonging to one of 



-10- 



v. ' 
finitely many states. The relative merits of this view are discussed 

in a more general context by Hilke et a*. (1977), Scandura (1971, 1973), 
^nd Spada (1977). 

The simpTest case^is where examinees are said to either know or not 
know the correct respons^vJIfte obvious problem is that under conventional 
situations, an examinee's response^ might not reflect his/her true state. 
For example*- a testee might choose the correct response on a multiple- 
choice test item without knowing what the correct response actually is. 
•Latent state models make assumptions about the way examjnees behave when 
responding to an item, or they. make assumptions about the way items are 
related to one another (.for example, it might be assumed they are hierarchi 
' cally related), or they assume that examinees respond to the same items - 
on two different occasions in time. Although very general models are 
available, no one model will be appropriate for. every item on every test. 
An investigator must make a decision about which latent, state model is , 
most appropriate and most convenient in a given situation. Once test 
scores are available, the chosen model- can be checked in various ways. - 
For multiple-choice items , it- now appears that one of two model s will 
frequently fit most or all of the items on a test (Wilcox, 1982b). If 
future investigations support .'tn1s'> result, it may now be possible to 
apply latent state models in a relatively straightforward manner. 

The purpose of this section is to review general theoretical results 
on latent state models that are' based on one of the three assumptions 
mentioned above. > . , " ' 5 



Test-Retest Models 

As a simple illustration of how latent state models work, suppose 
an item is administered to a random sample of N examinees on two separate ; 
occasions in time. Let 5 be the proportion of examinees in the population 

.of examiness who know the answer, and let 3 be the probability of correctly 
guessing the answer when the examinee does not know. In other words, for 

~a~ randomly sample examinee 

e = Pr(correct response ['examinee does not know). 
Let a 1 indicate a correct response, and a 0 an incorrect response ta 
an item. If p; . is the probability of the response pattern ij 0 on 
the two occasions (1=0.* 1; j=0» 1), if no learning takes place between 
the two administrations, and if the event^of correctly guessing is 
independent on the two testings, the probability of a correct-correct 
response pattern for a randomly sampled examinee is 

P n + ?)3 2 - ; (4.1) 
For the remaining three response patterns, it follows that 

ho = p 01 = 0 - 0 (i - e)e \ ■ 

and " .' 

P 00 - 0 - 00 - e) z • " . . ; \ . ' (4.3) 

The P^'s ar£ not known, but they can be estimated with x^/tf where 
x 1j is the number of examinees who get the response pattern 1j, It * 
follows that 

\s = T - P °°. - .(4,4) 

ho + p oo ? 

Thus, the unknown latent quantity B can be estimated by replacing the 
p^'s with x^/N. Note that the model implies that p^ 0 ='p Q1 which 
can be tested (McNemar, 1947), Results on the powe/of McN,emar*s 
test are given by Wilcox (1977a)-. Also, note that with a large enough 
sample the model will probably; be rejected, but it may be that p 10 
and t)^ are nearly the- same^ 



-12- 



If I is the estimate of 3 using equation (4.4), 5 can be estimated 
by replacing 3 with i in equation (4.1), replacing P n with x n /N, and 
solving for Some properties of this estimation procedure are given 
by Wilcox (1977a) i For example/ it is shown that if p is the common 



value of p 1Q 9 Pq-j under the assumption the model holds, (x 1Q + * 01 )/N 

is an unbiased, efficient, maximum likelihood estimate^ of p. 

A related and slightly more general model was proposed by Brownless 

and Keats '(1958). \ In addition to the latent parameters x, and p, the 

model includes the proportion of examinees who learn the item between 

the two administrations, and the proportion of examinees who repeat the 
v. . . - , ' • • 

' same response' from memory on the second testing. Not all of the parameters 

• in the Brownless and Keats model can be estimated, but s and B can again 
be" determined. For a similar model, see Marks and Noll (1967). 

The'Brownless and Keats model appears to be one of the earliest 
. attempts to go beyond the simple knowledge or random guessing model, that 
' is frequently adopted. Unfortunately, for practical purposes, the models 
just described are not convenient because they require two administrations 
of. an item. 

Models Based on Items That Are,, Assumed To Be Related 
in Some Par Jt4<^jfar Fashion 
This section reviews models where items are assymed to be related 
in a particular fashion. Two situations have been examined in the lit- 
erature. The first is based on the assumption that two or more items 
are hierarchically related, and the second is that items are equivalent. 
Two items are defined to be equivalent if all examinees know both items 



39 



ERIC 



or neither one. Of course models for hierarchically related items 
contain models for equivalent items as a special case. Consider two 
equivalent items and let e be the proportion of exami,ness who know both. 
.Let g- * be the prqbabi 1 i ty of the ^ response pattern i j on two equivalent 
items. If ^ is the probability of correttly guessing the response to * 
the first item when the randomly sampled examinee does not know, and 
if & 2 is the corresponding probability on the second item, and if local, 
independence holds (.i.e., given an examinee's latent state, the responses 
are independent) then 

ho = ° ." ' p 2 ) 

p 00 "= (i -5)0 -^)0 - » 2 ) • 

Solving for.?, 3 1 , and & 2 yields 

ho 



3o = 



ho + p oo 



and 

c - i - (p 01 + P 00 }(p 10 + p oo )/p oo / 
Again, the p.'.'s can be estimated in the usual . manner which yields an 
estimate of B v , and" B 2 (Wilcox, 1977b).. ' - „ ; - - > 

Multiple-choice items are the most obvieus examples where errors' 

at the item level (guessing) need to be considered; However, even when 

: ... • \ & 

completion .items. are used, it may be necessary to measure and correct • 
•errors at the item level (e.g., Harris & Pearlman, 1978; Macready & v 
~Daytonf-1977). — This time though, the, quantity of interest is 
* - a = Pir(incorrect res ponse| examinee knows) , 

and in the simplest case it is assumed that & = 0, Again ? and a can. 
be related to the p- .*s. In particular, ; 

• p ir = (1 ■ a i) ( V" f °2 ) V ' ". 

p 1Q = (1 - 0l )a 2 $ 



POI = Co l (1 " a 2 ) 

' .POO = C °l°2 + ( J 1 ?) 
Thus, . 

°1 = p 01 /(p 01 + Pll ) 



a 2 = J - P ri /(P 10 ^P n ) 

and • 

5 = (p 01 ' + Pn)(p 10 + *Pii>/Pii 




-15-V 



Replacing the P^-'s with their ususal unbiased estimate yields an esti- 
mctte of 5, a^, and For some related* models 4 and results see Knapp 
(1977), Harris and Pearlman (1978), and HarriS et al, (1980)/ 

If three or more equivalent items are available, it is possible to , 
estimate both $ and a using -the procedure outlined by Goodman (1979), 
or using the scoring method as in Macready and Dayton (1977). These" 

■ • < v ; • \ ' 

two estimation procedures rely on' iterative techniques that approxi- 
mate the maximum likelihood estimates of the parameters in the model. 

O •*"» ... . 

In practice, these techniques seem to converge very rapidly, and so 
sometimes they could even be applied when computer facilities are not 
available (cf. Kale, 1962). However, models can become quite complex 
necessitating computer facilities. ■ * 

Hotf can the assumption [that two or more items are equivalent be 
empirically checked? 4 One way is to apply a goodness-of-fit test to the 
resulting latent structure model as is illustrated by Macready and Dayton 
Cl97y>. (For some recent results and comnents on using goodness-of-fit " 
tests, see Smith fet-al.. 1981; Ko^ler & Larntz, 1978; Chapman, 1976.) , 
However, this approach is useless in the case of only twtr items (unless^ 
it is assumed that 3^ - e 2 and o = 0) because there are then three 
latent parameters 9 and only four possible response patterns resulting 
in zero degrees of freedom. ; / ; 



ERLC 



42 

i}>£ ... ., . . . . 



An alternative approach was suggested by Hartke (1978) that is 
based onr latent partition analysis, and an index proposed by Baker and 
Hubert (1977) might be useful in this endeavor as well If multiple- 
choice test items are being used, and if the test is administered » 
according to an answer-until -correct scoring procedure (which is described 
below), certain equalities are implied when items are equivalent, and m , 
these equalities can be tested (Wilcox, 1981d). Some additional possibilities 
are mentioned by Wilcox (I982f). - 

Hierarchically Related Items or Guttman Scales . - \.-. 

Latent structure models based on the assumption that items are 
hierarchically related or that the possible latent states form a Guttman * 
scale, include as a special, case the notion of equivalent Items, : In 
terms of equivalent items, examinees are described as being in one of 
. two states; they know both items or neither one. For two hierarchically 
related items, a third state is included, namely knowing the secohd 
item but not the first . Again, in certain special cases, the proportion 
of examinees in each of the latent classes can be estimated using simple ; 
(closed form) equations. Very general models are also available where 
estimates are obtained via iterative techniques (Dayton & Macready, 1980, 1976) 

As a simple illustration, consider two items and let c-j be the 
proportion of examinees who know the second but not the first. 
If the guessing rate is the same on the two items, i.e.:, ^ = e 2 = * 
say, then 6 • . V ;/",..•■■ 



p n = 5 .+ + (i - ? - ?1 )e 2 

ho = ? i (1 : b) + 0 - c - c.^ed - e) 

p 0 i ■ 0 - c - - &) ^ 

p 00 - 0 - c - ?t)3 2 .. . 

It follows that . 

e = p oi /( Poi + Poo } . 

? i = ^io p 0 i )/(1 - p ) , , 

c = i- e(i - 3)" 1 p 01 - ?! 

and so maximum likelihood estimates are easily obtained (Wilcoxi 1980a). 
This mocjel is, restrictive in the sense that e 1 = P 2 might be untenable, 
but much more general models are available which allow B-j f B 2 ( Da yton 
& Macready, 1976, 1980). - ' v . . 

Verifying Hierarchies 

Interest in learning hierarchies has been with us for some time 
(e.g., Gagne & Paradise, 1961; Gagne^ 1968; Cox & Grahman* 1968) but 
here attention^is focused only on the role latent structure models play 
in verifying hierarchies. Apparently the first method^of examining " 
whether two i terns are hierarchically related was proposed by White 
and Clark (1973). The procedure is based on the assumption that for 
each of the two items being investigated, an equivaf^nt^ 
The probability of the various response patterns can be written in terms 
of the relevant latent parameters which yields a test of whether the 
items are hierarchically related. Although White and Clark (1973) were 
explicitly in -erested in determining whether two items are hierarchically 



related, technically 'they were not the first to formulate a model that 
could be used for this purpose. . In particular,. Proctor (1970) proposed 
a-latent structure model where the latent states of examinees are assumed 
to form a Guttman scale.- A goodness -of -fit test could be used to check 
whether it is reasonable to assume items are-"hierarchicaVly related. 
Today Proctor's model would presumably be replaced by ones proposed by 
Dayton and Macready (.19.76, 1980), and again a goodness -of -fit test could 
be used. However, as was the case for equivalent items; there are situ- 
ations where this is inappropriate. Again the problem is that there 
are as many latent parameters as there are degrees of freedom. 
* A third method is based on an answer-unti Incorrect •scoring procedure. 
If two items are hierarchically related, certain equalities should hold 
which, for convenience, are described in a later section of the paper., . 

Some Concluding Remarks on Latent State Models for . 
Equivalent arid Hierarchically Related Items 

Clearly there are situations where the notion of equivalent or 
hierarchically related items is too restrictive. This point was raised 
by Moienaar (1981), and the author would certainly concur. However, there 
are situations Involving real data where the notion of equivalent items 
seems to be useful (Macready & Dayton, 1977; Harris & Pearl man, 1978), • 
More recently, Harris et al, 0980) applied an equivalent item model. . 
to real data collected in school settings. TJiis was done every week 
over a period of many weeks. All Indications were that the test results 
provided valuable and Valid information. Moreover, these models allow 
o>0, while th& models described in the next section, assume a = 0. . 



Methods of/estimating the parameters in latent structure models were . 
already mentioned, and typically these are used. For some related 
results see Harris et al . (1980), Rao (1973) , Wilcox (1977a, 1977b, 
1980a, 1980b), Haberman (1977), Herts et al . (1973), and van der Linden* 
(1981). , - 

, For some related general results ar;d comments on latent structure 
models, see McHugh (1956), Keesling (1974), Bergan et al . (1980), 
ReulVcke (.1977). Lazarsf eld and Henry (1968)*, Gibson (1959, 1962), 
Goodman (.1974), Green (1951), and G-iluVa (1979). For additional comments 
on how latent structure models relate to latent trait models, see van der, 
Linden {1978).'.. For an approach to measurement problems that is somewhat 
^elated to the discussion -in this ^subsection," see Cliff (1977) and 
Harnisch and Linn (1981). 

Models Based on Assumptions About How Examinees Behave 
When Taki ng Multi pi e-Choi ce Te'st Items 

Despite the very general nature of the model discussed by Dayton 
and Macready (1980), and some recent related results repprted by Micready 
and- Dayton (1980a) and Bergan et v al • (1980) f there remains the practical 
problem of initially deteiTnlning how Items relate to one another so that 
a particular latent structure model can be tried out on observed test ; 
scores. Another potential problem 1s that the items on a particular n^ltem 
test might not be consistent with ai>y particular fdrm-of ^e model • For^ 
practical purposiEts It woiuld be convent ent to have a nbdel tKat coul d be \; 
used to measure: the effects of guessing wl thout m items are ; 1 

rel a ted in any partlcul ar fashion. It Wo til d also be helpful 1f the 
model were easy to use, i.e., 1t coul d be used 1n a classroom with minimal 
effort.. A third desirable propertyi one related to the first, wbul d : <£ 



be the ability to easily fit a simple model to all the items on an 
arbi trarily chosen n-i tern test. t This last goal was reached in Wilcox 
( 1982b ) . Before indicating how this was done, some earlier results 

will be given first. 

': Suppose multiple-choice test items are scored according to an 
answer-until -correct" (AUC) scoring procedure. This means that examinees 
choose an alternative, and they are told immediately whether they are 
correct. One way to accomplish this is to have examinees erase a shield 
on a specially designed answer sheet which is available commercially. 
Underneath the shield is an indication of whether the examinee is correct. 
If incorrect, the testee chooses another response, and this/ process con- 
tinues until the correct alternative .is identified. 

Unlike other latent structure models, Wilcox (1981c) makes certain 
assumptions about how examinees behave when responding to a multiple- 
choice item, namely, that examinees eliminate as many distractors as 
they can (through partial information) and then guess- at/ random from among 
^he alternatives that remain. This assumption is not new' (e.g., Horst, 
1933), but it -was not previously used in conjunction with latent state 
models. Undoubtedly this assumption. is an over simplification of reality, ^ 
but it has proven to. be consistent with most of the items studied by Wilcox 

(1982a, 1982b, in press a),. 

For a randomly sampled examinee responding to a particular item, let 
C again be the probability the examinee knows the answer, 'and let cj. 
(1«1 r > ... t-2) be the probability the examinee can eliminate i dis- 
tractors if he/she doesinot Imow, where t is the number of alternatives . 
If p is the probabil ity that a randomly sel ected. exami nee gets the correct 
response s on the 1th attempt of an item, then . 



t-2 ' ' • • > 

P, = 5 + I C,/(t - i) -. ! (4.6) 

1 - i=0 1 » 

f ' . ■ V 

\ 

and - s . 

• ■ t-i 

p i = :^7(t^ f J) *; 0=2, .... t) . .' (4.7) 

' ' .."'< • 
It follows; that c s Pj - P 2 - Thus, if in a sample of N examinees^ x- 

are. correct* on the ith attempt, then 

\ (X,-- x 2 )/N . V ;; ; 14.8) 

is an estimate of 5. The model implies that 

/■•..■■■ ■ 

' Pl L^z^-"' - P t ' " '• 

* " - . ' 

and this, can be tested (Robertson, 1978). Empirical investigations 

■ : o - ... 

•(Wilcox, 1982a, 1982b) suggest that (4.9) will frequently hold. 

Equation (4.9) rules out "the misinformation model proposed in Wilcox* 
Cl982b) , but it is difficult to say whether testing (4.9) gives a strong 
indication of whether the model holds. Perhaps some o^her model could 
be derived that explains existing .data (e.g.-, Hutchinson, 1982). In ad- 
dition, the random guessing component of the model is undoubtedly untrue 
(i.e.,- examinees guessing at random once they el imi nate as many dl stractbrs 

.as possible,). ' However, an empirical investigation into an implication of 
the random guessing component of the .model suggested that the model gives 
a tolerable approximation of reality (Wilcox,, in press a). When this in- 
vestigation was conducted, it was thought that a generalization of the 
AUC model would be needed that takes Into account^ J^^jp^..lh-*Alchv 
distractors are chosen. So far, though, it seems that the simpler model 
described above will suffice. * - 



The latent structure model just described implies that equation (4.9) 
must hold for the population of examinees. In a few instances this assump- 
tion, appears to be unreasonable, and the question arises as to how these 
results might be explained. The solution proposed by Wilcox (1982b) 
is that some of the examinees haye misinformation relative to the question 
being asked. This appeared to be a reasonable speculation based on. the 
way the questions were phrased, 9 *and so 'a modification of the answer-until- 
correct scoring procedure was proposed: For example, one of these items, 
dealt with the weight of iron after being heated. The examinees (who were 
approximately 14 years old) were told that when heated, iron expands. They 
were also told the weight of the iron before it was heated. They were ' 
then asked, what the weight of the iron would be when red hot. Three of 
the; alternatives were weights that were higher than the weight at room . 
temperature. Thus," it seems reasonable that some examinees might believe 
that iron is, heavier becalse it expands, and they would therefore choose 

" • » • ' ' ' . * I • * ' ' • \ 

among the three alternatives consistent with this belief. 

In contrast to earliier model s, i i was decided to derive a latent 
structure model* where examinees belong to one of three latent states 
rather than only two, namely, they know the answer, they have misinfor- 
mation as, jus't described, or they are 1n complete Ignorance and guess at 
randonu The resulting model gav'e a good fit to the data, and a similar 
model was derived for the other item that did not fit the original answer- 
until -correct model described above. The point that Is particularly 1n- 
teresting Is that observed responses to all 30 Items on the test could 



be explained ^with models that are very easy to use. 

Despite the advantages -of this model , there may be situations where 
certain features are objectionable. For example, the model assumes that 
an item has at "east one effective distractor for those examinees who dp 
not know. Pi other way , } it is assumed that no distinction is made be-, 
tween examinees who know, and those who can eliminate all of the tiis tractors 
For practical purposes, the seriousness of this problem is not known. An- 
other feature is that.it assumes a = Pr(incorrect response (examinee knows) 
= 0. Again' the seriousness of this restriction is not well understood, 

, Q Some Miscellaneous Models 

: c ■ ■ \ ■ ^ " 1 ' ' * 

In addition to the models described so far, three slightly related 

models have been proposed; by Reulecke (1977). The first, which Reulecke 

calls the Poisson-blnomial model , assumes that examinees are responding 

ton equivalent Items, for examinees who know, it is assumed that they 

give an incorrect response to x items with probability h x exp(-h)/x!, 

the Poisson density, where* h is an unknown parameter. For examinees 

who do not know, it is assumed that £ 3 .5. His second model replaces 

the assumption that e = .5 with the assumption that guessing x items 

given the examinee does not know is u ~ exp(-x)/(n-x) where u is an unknown 

parameter. The third model 1s the same as the last except that an >' Q 

1 additional latent state is included .namely, that ^orae examinees guess 

at random. ' V:". :'■ 



24 



An alternative approach to measuring. mis information was proposed-by 
• ■ •• *' . ;. X.- 

Duncan (1974). • For a particular n item test, Met S-j be the number of 

. • " " ', ■ \ .' 

items an examinee knows, and let 6 2 be the number of items for which the 

examinee 'has misinformation. If every item has t alternatives, Duncan 

assumes that guessing is at random, and that the probability of getting 

x items correct is 



n - 6, - 6 



1 °2 «,-x 



t-1 
t 



in-6^-x 



.Both Bayesian and empirical Bayesian estimates of Spare discussed. 



5. Applications of Latent Class Models, and the Need To Correct 
For Guessing 

Latent class models can now be used to analyze items, analyze n-item 
tests, and they can be used when an item sampling model 0 is deemed appro- / 
priate. This section outlines the procedures that are available. . The 
.main advantages of these procedures are that they provide ways of- dealing 
with guessing that are not possible with other models. But why worry about 
guessing? Perhaps guessing will have little effect on the purpose of a 
test f Of course answering this question is crucial in order to motivate— 
the procedures described here, and so a few comments will be made aldng 
these lines. 

Let u be the proportion of items in a domain of items that an exam- 
inee .knows, and suppose the goal is to determine whether w>u Q for some- 
predetermined uq. This problem has received considerable attention in 
recent years as evidenced by the 1980 special issue of Applied Psychol og- 



25 



ical Measurement ; Suppose w 0 - .8, and that it is desired ti> choose n, 
the* test length, so that the probability of correctly determining whether 
c>to n ' ct'leasz .? whenever we. 9 or u<.7. From Wilcox (1980b), n=29 - 
jii. are reqt Van den Brink and Koele (1980) pointed out that even 
"random ng can be assumed, about five or six times as many items 

are needed to ensure the same level of accuracy as when there >s no guess- 
ing. Wilcox (.1980b), noted that random guessing can not be assumed in which 



ERLC 



ease over 2,600 items are needed. ,? v 

a As another illustration, Ashler (1979) observed ttf at guessing can _^ 
seriously affect the- estimate of the biserial correlation. _ 

A third reason to be concerned -about guessing is that it might be & 
important to detennine how many items on a test- an examinee knows, or even 
™^h1ch^ft^s^^ — 
tant when measuring achievement, but guessing can seriously affect the 
* results. An illustration with real data is given in Wilcox (1982d). 

finally, most solutions to measurement problems ignore guessing, or ^ 
assume guessing is at random.' Perhaps .one of these assumptions will give - 
reasonable results in some situations, but all indications are that this is 
not always the' case. In fact guessing seems to be more serious than might 
at first be expected, and so it seems that' there? might be few measurement 
problems where guessing can be ignored. It might appear that certain latent 
- trait models handle guessing, but this i$ not necessarily the case because 
- the type of guessing examined in latent class models if different from \. v 
the type of guessing i n 1 a tent trai t models. This pbi nt i $ elaborated in i 
Section 5. •' • \" ' . / , 



52 



The way latent class . modiTTlre applied will depend in part on 
whether an item sampling view vs-believed to be appropriate, whether 
operational versions of the test are to : be based o?r conventional scor- 
ing or AUC scoring, -or whether items ? can be assumed to_ be related in a 
particular fashi^ri^^f^nveritional scoring is to be used, then pre- 
liminary, investigations of a test might be made via AUC scoring to de- 
tera^e-which-Uems^r^^ 

the overall accuracy of a specific n-item test. Methods for solving these 

problans are outlined- below. • 

' ! _ . ■ j » j . 

Analyzing an n-Item Test 
Consider__an n-item test, and suppose the goal is to determine how 

* , ■£ ■ •.. . :.. ... c 

many items an examinee knows. Further suppose that it is decided an 

examinee knows if and only if the correct response is given. How accurate 

is the test for the typical (randomly sampled) examinee? 

Let ti be the probability of making^a correct decision 

about whether an examinee knows or does, hot* know the ith item when a con- 

. ... . * ... v f. 

ventional scoring procedure is used. The parameter is easily estimated' 
under an answer-until -correct scoring procedure; it is one minus "the prob- 
ability of a correct response on the second attempt (Wilcox, 1981c). *A 
natural. way to characterize an n-item test is to use 

the expected number of correct decisions for a randomly sampled examinee 

who takes the test. 

In some cases some additional related' information is useful. Suppose 



-27- 



\ 



for example, there are n = 10 items', and t $ is estimated to be 7. 5 That 
is/ the expected number of items 'for which a correct decision is.made 
about\ha.t an examinee knows is estimated to' be 7. To get a better in- 
dication of how well the test is., performing it would be useful to-also 
know the likelihood ojf-say at least 8 'correct decisions among 'the n * 10 
iiims^l Knowi ng^ t s does not yield much information about th>s value. 

More generally, .let p. be the probability of making at least k 
correct decisions among the n items about whether a typical examinee 
knows. Certainly. p k is a useful measure/of how well a test indicates 
X what a typical. examinee knows. If \ % or P k is'judged to-be too small, - 
M^the te.st needs to be modified in some. way. For example, the number of 
'v/distractors might be increased, or perhaps the existing distractors 

might be improved. 

The parameter p k can be expressed symbolically, and more precisely,, 
: .in. the following manner. Suppose it is decided that a testee knows the 

answer, to an item if and only if the correct resfwj.se- is given on the . 

"first attempt of. the item. For a randomly sampled examinee,- let y i = 1 
* if a correct decision is made about the examinee's latent state on 'the 

ith item; 1 otherwise y n - * 0. Then 

*: - s ' . p k - prCL v i >*)• -\" V • 

W-ilcox (1981 b, 1.982f ) refers to P k as* the k out of n" reliability 

of a test. ° ' 

<& ■ • • * . • ■ • ~ 

• ' ■■■■■■■ ■ ■ ■ - v • 



54 



ERIC 



In classical test theory, the reliability of a test can be estimated 
if two parallel forms exist. Of course, 'no .two tests are ever exactly 
parallel, and so bounds on the reliability are. used instead. The best 
known bounds are_the; Kuder-Richardson formulae. These bounds are expressed 
in terms of unknown population parameters such as; the difficulty level 
of the items on the test, and variance of the test scores. Although 
these parameters are not known, they can be estimated. A similar situation 
occurs in terms of esti 'mating .p^. If it can be assumed that y.. is in- 
dependent of y. r , i t j, could be^estimated (Wilcox,. 1982c)., . \ * . 
However, there may be cases* where this independence does not hold in . 
which- case there is no method of estimating p^. However % both upper and 
"lower bounds on p., are available, and these can be estimated (Wilcox,, 



1982f, 1981c). Even if y^ and y^. are independent, estimating p k can 
be a computationally tedious-pmcesTliflTen n is large, and so again 
these bounds might be useful. \J * 

In the event y. and y . .are independent for all i and j, it is also 
possible ..to makfe inferences about whether p k is large or. small (Wilcox, 
1982c). Unfortunately, there is currently no empirical procedure 

for determining when this independence might hold, and so some caution 
should be exercised. 

More recently, Wilcox (in press b) proposed an approximation of p k 
that appears to work well when n is small, say n < 5. For larger 
values of nHhe Bdnferroni inequality can be ) applied as indicated by 
Wilcox- 1 



•29- 



What To Do When t s or p k is Too .Small 



9 

ERIC 



v If the estimate of x or p R is judged to be too small, two general- 
- approaches are available'. First, identify which items*, are seriously 
affected by guessing, and either increase the number of distractors, or 
attempt to improve the ones that are being used. The second approach is 
to use a scoring procedure based on an AUC test proposed by Wilcox 
(1982e). However, the effectiveness of Wilcox's scoring procedure is not 

| known when the number of examinees is small . An investigation into this 

i * " * 

problem is underway. 

If the first approach is selected, two measures are available for 

.deciding whether distractors for an item are working well . The first is 

to use some Schur function tsee Marshall and Olkiru 1979), such as 



H(p 2 , ...V P t ) - -I 



In 



1 - Pi 



the erftropy function 'which measures how "far away" guessing is from being 
random. H is also known as Shannon's measure of information or diversity 
If guessing is at random, in which case the distractors have achieved their 
maximum effectiveness, 

P2 = p 3 " Pt 

and H attains its maximum value. Its minimum^alue occurs when p-r s 1 - p 0 
and p 3 ■ p 4 » ..... = p^ = 0, in which case guessing is as far away 'from being 
random as 1t can be. * K 4 v - ' v. ./ 

V Wilcox" (1981c) proposed another measure of how well the distractors 
are performing-. Labeled a, it is just the ference between the maximum 
possible value of V (for fixed c) and the actual value of T . An il lustra- 
tion of the A measure is given in Wilcox (1982b). 



• The entropy function measures the extent to which Pg, P 3 , •••» P t 
are unequal; the closer the abstractors are to being equal, the closer 
is the item to the ideal situation where guessing is at random. H can 
be estimated by replacing the p^s wi th X-/N. This yields a maximum 
likelihood estimate of H, say H, but the exact distribution of H is com- 
plex and cumbersome to work with, and an asymptotic approximation of the 
distribution of H tends to be unsatisfactory unless N is very, large 
(Bowman et al., 1971). Accordingly, it -might be convenient' to have some 
other index that" measures the extent to which p 2 , .... P t are unequal. 
It turns out that a whole family of functions exists that have properties 
similar to H "(Maf shall 7 & 01 kin, 1979) . One function that seems especially 
convenient is Simpson's measure of diversity (Simpson, 1949). For the 

situation at hand it is given by 

t ' 2 

i=2 1 • 1 - u v, .. - • ' 

Note that random guessing can be tested testing p 2 - p 3 = ... p t 

(see Smith et al., 1981, Wilcox, 'I982el, ' But if the null hypothesis 

is rejected, the real question is how far away the item is from random 

gdessing, and the measures S and H answer this question. 

Alam and Mitra (1981) report some results on the distribution of 

■ - $ .2 

i=l ... \ 

which might be use< to nfake inferences about S, but there is an error in 

their results. Alam (1981) confirms the error, and a correction is in 
preparation. 



Testing Whether Items are Equivalent or Hierarchically Related 

The same model used to make inferences about p k c§n also be used 
to test whether items are equivalent or hierarchically related. Tfie 
procedure can be briefly outlined as follows. For a randomly sampled 
examinee responding to a pair of specific test items, let 5^ be the 
probability of being able to eliminate i distractors from the first 
item and j distractors from the second. The proportion^pf examinees who 
know both items corresponds to C^-i t lj where t 1s still the number of 
distractors. If p^ is the probability of a correct response on the \ 
kth and mth at tempt of the two items respectively, then u nder certain 

mild independence assumptions, ■ t « 

' t-k'-t-m * ' ' 

- m • i-0 j=0 13 

If the two items are hierarchically related, then somfe of the c'fj's - 

must equal zero, which in turn means that some <|fthe Pj^'s -will be 

equal to one another. An illustration is given 1n Wilco* U982f). 

These equalities can be tested in numerous ways yielding an empirical 

check on whether the items areJiierarchically related. 



Correcting for Type II Guessing r 

All of the applications that have been described are based on 
what Wilcox (1981c) calls Type I guessing. This just means that guess- 
ing 1s defined in terms of a randomly sampled examinee responding to a 



randomly sampled item.. That is, an examinee's guessing\bility is 



the probability of giving a. correct response to a typical test item 
that he/she does nbt knowj The situation, is similar to the item 
sampling models described earlier, except that guessing is taken into 
account. Rather than estimating 5, an examinee's percent correct 
true score, the goal Is to estimate w, the proportion of items in the 
item pool that the examinee knows.,' .' 

It is a simple matter to adjust latent structure models developed 
under Type I guessing to j the problem of estimating w (e.g., Wilcox, 
1979b. 1981c, 1982&1, Consider, for example, an answer-until -correct 
test* If, for a Specific examinee, a- is the probability that he/she 
can eliminate i distractors from a randomly chosen item, then the probability 
of getting an item correct on the fJrst attempt, is 

■ ..t-2'./ ' • . 

q,- = u + 1 /»i/(t - i) 

. -i=P ■ - . . j 

and the probability of a correct response on the second attempt is 

* . t-2 ■ • < ' . ; ' ' 

q ? = I ii/lt-D 
. ' 1=0 1 , - . ' 

so ' « •• - 

u = q.j - q 2 • 

Thus, if on an n-itera test there are z. items for which the examinee is 
correct on his/her 1th attempt, the estimate of u is simply 

Indeed/ all of the results under Type I guessing are also available under 
type II guessing. 



Should interest be directed toward determining which (or how many) 
^of the n items on a test an examinee knows, or toward estimating «,'■'-■. 
oY* both? Macready and Dayton (1977* 1980) argue that at- least in some 
situations, the former goal should be sought, and that perhaps formulating 
the goal of a test in terms' of (i). should be avoided. It woul d seem that 
the solution to thiV problem will depend on what exactly an investigator, 
wants to determine, andof course this will vary from situation to 
situation. \ % 

An advantage of estimating u> with an answer-until -correct scbring 
procedure 1s that it, can substantially reduce the problems noted by 
van den Brink and Koele (1980), and Wilcox (1980b) when trying to'deter- 
mine whether w is above or below some known constant. This .isj one of 
the problems mentioned at the beginning of this section. In situations 
where an answer-until -correct scoring procedure can be used, there are 
now two related solutions that might be adopted (Wilcox, 19820, 
(I982d). *. The former approach is particularly well suited for com- 
' puterized testing where £ sequential scoring rule can be used. 

^ Strong True Score. Models 

/ , - . ■ • ' . ■ • • 

s ■ ' . ' ' • ' • • - • ' 

/As previously indicated, the Type II guessing model under /the answer- 
until -correct procedure implies that w, the proportion of items an v 
examinee knows, is equal to qy - q 2 , where q | is the probability of a 
correct response on the ith attempt of a ■ -mxk n ly selected item. Undf 
a conventional scori ng procedure where an exai>\ ? nee cjets only one attempt 
at an item, q^ iis th If for a 

. pppul ation ^f exa^ of q j coul d be determined, 



-34- 



many practical measurement problems could be soTved (Lord, 1965; Lord 

& Novick, 1968, ch. 23; Wiicox, 1981a). The most frequently used approach 

when estimating this distribution, say g(q-j), is to assume that ••••< 

*<V =I frrf q r 1 ( 1 -V s " 1 '<«•*> 

1 r(r) (s) 1 1 
the beta density with parameters r > 0, and s > 0, anil where r is the 
gamma function. Empirical studies cited by Wilcox (1981a) indicate 
that (J5.1) will frequently give good results when addressing various 
measurement problems. 

Is it possible to develop a similar strong true score model that 
takes into account the guessing ability of the examinees? Wilcox , 
(1981a) summarized results on several models that have been proposed, 
and so they wi 1 1 not be discussed here. The important point is that all 
of the strong true score model Previewed by Wilcox (1981a) now Appear to 
be totally inadequate for both theoretical and empirical reasons. Some 
of these models were based on the assumption that guessing 1s at random, 
but recent empirical Investigations indicate that this 1s highly unsatis- 
factory (Wilcox, 1982a, 1982b)1 See also Bliss (.198.0) and Cross and/ 
Frary (1977). Other models were based on a multivariate analog of the 
beta-b1nom1al distribution (the D1r1chlet-mult1nom1al) which allowed 3 to 
vary over the population of examinees'. This model Implies that « and e 
are independent (Wilcox, 1981b) but this appears to be an -unsatisfactory 
assumption because the model gives a very poor fit to data. 

Coombs et'al. (1956) suggested* that an examinee's guessing ability 
Increases with the proportion of items he/she knows. Wilcox (1982a) 
proposed a 'strong true score model based on this assumption and an an- 
swer-until-correct scoring procedure under Type II guessing. Among the 



•'. -, ' : ' -35- • 

several, motlels that were considered, this was the only t model that gave 

a reasonable fit to the data. A more recent empirical study got similar 

results (Wilcox, 1982b j; _ • 

The model assumes (5.1) holds, and-as already mentioned, this 

- frequently gives good results with real data. Let y = q 2 /(l - q-j). 

The model .also assumes that y can be written as an increasing function of 

q^swhich is given by 

f q l r(r, +^s,) r r l s,-l ■ , 

Y(qi)=c — 1 ^u 1 (1 - u) 1 du + (t-l) 1 

1 /0 rtr^rts^ 

where c, r-j > 0, and s^ > 0 are unknown parameters that are estimated 

from observed test scores. (The subscripts on r and s are used to 

distinguish them from the parameters r and s used earlier.) A method 

of estimating c, rj, and s-j is described by Wilcox (1982a). 

This model. can be used to solve many measurement problems that were 

previously impossible to solve. « For example, suppose a conventional test 

is administered, and it is desired to. correct for guessing without assuming 

guessing is at random. If the function -r(q-|) has been previously estimated, 

then U = q 1 -t^)-- If y 1s arbitrarily set equal to (t - l)" 1 , the 

usual correction for guessing formula score results. , 

It should be mentioned that while J t is possible to c.«nSBCt; for 

guessing under the answer-un til -correct procedure, alternative scoring 

rules might be "preferred (Brown, 1965; Dal rymple-Al ford, 1970). These 

scoring formulae do not estimate «, but instead give an examinee credit 

y for having partial information. Whether this 1s desirable will depend 

on the examiner's goal. Of course, several bttifir scoring procedures have 

been proposed; some of which are discussed by Frary ( 1980). The important 

point is that none of these rules yields an estimate of w. The same is 

true of the procedure .proposed by Gibbons, 01 kin; and Sobel (1579), and 

the ru'I e % suggested by Austin (1981/. Note th#i Vstin's procedure is the 

same as one proposed by Arnold and Arnold. (1970) which ts discussed by 



o Frary (1980). 



:!;>.-' : .-^;.T:.;-,; .,■ ... ,<. : -.\j .Vr.' :4y;?>:;-.v. '/Iv'^r* = v. i >*r. ; ; 



■■..■^.^ Additional Applications 

" Several other applications of latent class models have been exam- 
ined in the literature which are only mentioned^ here. These -include a 
tailored testing procedure (Wilcox, in press b) that might be used 
when computerized testing is feasible. Knapp .(.1977) discusses a 
a reliability coefficient that is based on a latent state point of view, 
and Emrick (T97L) describes how these models might be used to determine 
the passing 'score of a criterion-referenced test. Emrick's estimation 
procedure was shown to be incorrect (Wilcox & Harris, 1977), but this 
problem is easily* corrected using one of the estimation procedures al- 
ready described. A closed form estimate'of the parameters in Emrick's 
model is given by van der Linden (1981). 

6. Further Consents on How Latent Class and Latent Trait 
! Models are Related 

, In the three parameter latent trait model given by equation (2.4), 
the parameter c is sometimes called a guessing parameter. Hopefully by 
this point it can be seen that this parameter has nothing to do with the 
notion of guessing used in latent class models. The parameter c is just 
lim p(e). Thus, c refers to the probability of a correct response to 
an item for' a particular type of examinee, namely, examinees for whom 6 
is small. For latent class models guessing is defined in terms of a 
specific item and a population of examinees who do not know, or a specifi< 
examinee and a domain of items that he/she does not know— this is differ- 
ent from the population of examinees having e xmal 1 . , Suppose for exampli 



p(9) =J s. Using the. item sampling interpretation of p(e), this means 
that among aU the items having item parameters a, b, and c, the prob- 
abilit^of a correct response is h for an examinee with ability level - 
8. put this suggests that the examinee does not know all of these items, 
in which case some answers will be correct by chance. But how does the 
parameter c correct this difficulty? The answer is that it doesn't deal 
>1th this problem at all. 

* • ■ . 

Some writers have interpreted p(&) in (2.4) as the probability of know 
ing an item which suggests. that latent trait-models might be related to 
latent class models, but no sijnple relationship has been established when 
errors at the item level exist because the^modeTs measure different things. 
In fact, if this interpretation is used, all estimates of the item para- 
meters in (2.4) break. down when multiple-choice items are used. To see 
this, note that in order to estimate a, b, and o, it would be necessary to 
determine which items (or how many itsns)^ an examinee knows. But what is 
observed is only which items were answered correctly. In some cases perhap 
this is not a serious problem— it seems that more work is needed in this 
area. Mislevy and Bock (1982), as well as Wainer and Wright (1980) have 
given some attention to the problem of estimating latent Hrait parameters , 
in the presence of guessing. However, the model they used for guessing. 

behavior is different from the notion of guessing in latent class models. 

•■ ■ . ■ ' •''.*• ■ 

• .i *■ " . . ' ■ ■ • ■ ■- 

..." • -' 

To further differentiate the two models, perhaps a* more general 
theoretical description of true score models will help ^ ^Slrotnik and 
Wilcox (1982) point out that certain notions in Tprgerson (1958) can be * 
used to describe $ model that contains as a special case alt the true 
score models described in this paper . Their developments are briefly 



38 

" Let y be some "ability" parameter that characterizes an examinee. 
For a randomly sampled examinee, let p^) be the conditional probability 
of a correct response to the ith item on an n-item test given that the 
examinee has ability *. Let p x («) be the conditional, probability (given j) 
of x correct responses, and let g(ifj be, the probability density function 
of i|) for the population of examinees. Then 

■ P i = / P i t*)g(*)d* . 
is the probability -of a correct response to the ith_ item for a randomly v 

sampled examinee. 

•' A basic problem is determining what * should represent. For a latent 
class model, the simplest case is for a single examinee and a'single^ item 
in which case the only two possible '.values for f are 1 (the examinee knows) 
or 0. (the examinee does not know). Then gUO is the proportion of examinees 
who know. For the AUC model the possible values of * are 0,1,..., t-1, and 
P* UMt-*)" 1 f° r a randomly sampled examinee. Note that for these models, 
an examinee's ability is defined in terms of a specific item, and this can 
be used as a basis for defining ability in terms of the number of items . 
known on an n-item test, or the proportion of items known in an item domain, 
FOr latent trait models * does not indicate what an examinee knows, but 
rather, it determines the probability of knowing when there are no errors 
at- the item level such as guessing. Another important point is that to 
say the item parameter, c is the same as the guessing parameter in the AUC 
model is to somehow equate c to p^*) given that *<t-l. y 

Fd/an item sampling model based on a latent class model, * .Is the 
proportion of items in an item domain that an examinee knows, 0<ij»<l, and 
p = C n )* X U-*) n " X « In latent trait models, the probability of a correct 
response to the ith item depends- on a, b, and c. Thus, as previously 
polnted'put, for latent trait .models, p i C*)^E abc U) » wnere E a bc ' • 



means expectation with respect to a, b and c. Als;o 

P i = ////Pj(*)gOMtb,c) da, db, dc 
where gC^» a, b,c) is the joint density of a, and c, 



it is not, being argued, 
test theory, latent 



7. .Concluding Remarks - 

. As was stressed at the beginning of the paper, 
that th^ other approaches to measurement (classical 
trait* models, and item sampling models) b;e abandoned, or that they are 
intrinsically bad in any sense. It /is being argued though that careful 
examination of the goal of a test shpuld be made before a true score 
model is chosen. Generally different models give /different solutions to 
the same problem. For example, when determining how many distractors 
should be used, latent trait models can be applied (Lord, 1980), but the 
criterion used is different from the one used in latent state models. 



Another reason for choosing a model carefully is that some -writers 
have argued that latent trait models do not j. address many of the measure- 
ment problems that are currently of interest (e.g. •Baker, 1977). The 
primary point in this paper is that latent class models give the test 
constructor ways of examining measurement problems that did not exist a 
short while ago. By using" latent class models in conjunction with other 
true score models, tests can be analyzed in a more effective manner, than 
ever before. ° ! 



40 



i REFERENCES 



Alam, K. Personal communication, ,1981 . „ • 

Alam, K. , & Mitra.A. Polarization test for the multinomial distribution. 

Journal of the Ameri car i Statistical Association , 1981 , 76, 107-109. 
Arnold, J. C. , & -Arnold, P. L. On scoring multiple-choice exams allowing' 

for partial knowledge. The Journal of Experimental Education , 1970, 

39, 8-13. "". 

Ashler, D. Biserial estimators in the presence of guessing. Journal of 

Educatiorral Statistics , 1979^, 325-355:, 
Austin/ J. D. Grading distractor-identification tests. Psychometrika , 
•• 1981, 46, 129-138. ^^^^ 

Bakery F. -B. Advances in item analysis. Review of Educational Research , 
1977, 47, 151-178. 

Baker, F. B., & Hubert, L. 0. Inference procedures for ordering theory. 
Journal of Educational Statistics , 1977, 2_, 217-233. 

Bergan, J. R. , Cancelli, A. A., & Luiten, J. W. WsTery'assessment with 
latent class and quasi -independence models representing homogeneous 
item domains. Journal of Educational Statistics , 1980, 5,, 65-81. 

Birnbaum, A. Some latent trait models and their use in inferring an 
examinee's -ability. In T. M. Lord & M. Novick (Eds.), Statistical 
Theories of Mental Test Scores . Reading, Mass.:. Addison-Wesley, 
.'■». 1968. 

: ■ U 

j • 

Bliss, L. B. A test of Lord's assumption regarding examinee guessing 
* behavior on multiple-choice tests using el-ementary. school students. 
Journal of Educational Measurement , 1980, -17, 147-153. . 



ERIC 



Bowman, K., Hutchestfn, K. , Odum, E. , & Shenton, L. Comments on the 
distribution of indices of diversity. In G. Patil, E.'Pielou, & 
W. Waters (Eds.), International Symposium on Statistical Ecolog y 
(Vol. 3). University Park:* Pennsylvania State Press,, 1971. 

•Brown, J.'"" Multiple response evaluation of discrimination. The British 
Journal of Mathematical and Statistical Psychology, T 965, la, 125-1371 

Brown! ess, V. T., & Keats, J.^A. A retest method of studying partial , 
knowledge and other< factors i nf 1 uencing i tern res ponse . Psychometri ka , 
1958, 23, 67-73. . r 

Chapman, J. W. A comparison of the X „ -21ogR and multinomial probability 

' * ■ * . - • ** • 1 . * 

criteria for significance .tests when expected frequencies are small.; 
Journal of the American Statistical Association ; 1976, 71', 854-863,, * 
Cliff, N. A theory of consistency of ordering general liable to tailored^ 

V 

testing/ Psychometri ka, 1977*, .42, 375-399, 
.... i ( . . / ^ 

Cfoombs, C, H., Milholland, J. E. k , & Womer, F. B. The assessment- of 

partial information. Educational and Psychological Measurement , 

1956, 16, 13-37. 

Cox, R. C, & Graham, 6. T. The development of a, sequentially scaled 

achievement test* Journal of Educational Measurement , 1966, 3, *. 

■ * * 

147-150. v , 

Cross, L. H., & Frary, IK B. An empirical test of Lord 1 s theoretical 
results regarding formula-scoring of multiple-choice tests/ Journal 
6f Educational Measurement s 1977, 14, 313-321. 

Dal rympTe-Al ford, E. C; A model for assessing multiple-choice test 

■ ■ • - » * 

performance. British Journal of Mathematical and Statistical 
Psychology , 1970, 23, 199-203. 

• . ■ ■* *. 
' • * * • • x ' Si" • 

'• • • •* 



Dayton, C. M:, & Macready, G. B, A probabilistic model for validation 
of behavioral hierarchies; Psvchometrika , 1976, 41, 189-204. 

Dayton, CM., & Macready, G. B . ° A seal ing- model with response errors 
and intrinsically unscalable respondents. Psyc hornet rika, , 1980, 45 g , 
343-356. . ' . 

Diamond, J., & Evans, wV The correction for guessing.. > Review of 
Educational Research , 1973, 43, 181-191 . 

Duncan, G. J. An empricial Bayes approach to scoring multiple-choice 
tests in the misinformation model. Journal of the American Statistic al 
Association , 1974, 69, 1 50-57. . . ' .. 

Emrick^'J. A. >n evaluation model for mastery testing. Journal of 
, * « ■•• ■»',-. 

Educational Measurement , 1 971 , 8, 321-326'. • 

Frary, R. B. the effect of misinformation^partial ' information, and 
. guessing on expected multiple-choice test item scores. Applied 
• Psychological Measurement , 1980, 4, 79-90, 
Gagne, R.M., & Paradise, N. E. Abilities and learning sets in knowledge 

. acquisition; '♦ Psychological Monographs , 1961 , 75, -1-23. 
Gagne,. R. M. Learning hierarchies. Educational Psychologist, 1968, 

6, 1-9. ! : . 

Gibbons, J. D., 01 kin, I. , & Sobel , M. A subset selection technique 
for scoring items on a multiple choice test. Psvchometrika, 1979, 
'44,-259-270. ^ ' 

Gibson, W. A'. Three 'multivariate models: factor analysis, latent 
structure analysis, and latent profile analysis. Psychometrika, 
1959, 24, 229-252. - <T" 
Gibson, W. A. Extending latent class solutions to other variables.. 
Psvchometrika, 1 962, £7, 73-81 . 



43 



Gilula, Z. Singular value decomposition of probability matrices: 
Probabilistic aspects of latent dichqtomous variables. + 
Biometrika , 1979, 66,, 339-344. , 

Goodman, L. A. Exploratory latent structure analysis using' both 

identifiable and unidentifiable models. Biometrika , 1974, 61_, 

215-231. > 
Goodman, L. A. On the estimation of parameters in latent structure 

analysis. P sychometrika , 1979, 44, 123-128. 
Green, B. F. general solution for the latent class model of latent 4 

structure analysis. Psychometrika^ , 1951 , 16, 151-166.. 
Haberman, S. J. Product models for frequency tables involving indirect 

observation. The Annals of Statistics , 1 977^ 6, 1124-4147., 
Hambleton, R. K. , Swaminathan, H., Cook, L., Eignor, D. R., & Gifford, 

J. Developments In latent trait theory: . models, technical issues, 
s and applications. Review of Educational Research , 1978, 48, 467-510. 
Harnisch, D. L. , & Linq^ R. L. Analysis of item response patterns: 

Questionable test data and dissimilar curriea^um practices. Journal 

of Educational Measurement , 1981, 18, 133-146. 
Harris, C. tf., & Pearlman, A. An index for a domain of completion or 

short answer items. Journal of Ed u cation&l Statistics , 1978, 3, 

285-304. - 

Harris, C. W., Houang, R. T., Pearlman, A. -P., & Barnett, B. Final 

report submitted to the National Institute of Education,. Grant 

No. NIE-G-78-0085, Project No. 8-0244, 1980. 
HartkeV A. R. The use oflatent partition analysis tO/identify 

homogeneity of an item population. Journal of Educational Measurement , 

1978, 15, 43-47. . T. " 



Hilke, R., Kempf, W. F., & Sc'andura, J. M. Deterministic and probabilistic 
theorizing i n structural learning. In H'. Spada and F. Kempf (Eds.), 
Structural models of thinking and learning . Bern: Hans Huber, 1977. 
Holland, P. W. When are item response models consistent with observed 

data Psychometri ka , 1981 ,.42, 79-92. 
Horst, P. The difficulty^of a multiple choice test item. Journal of 

Educational Psychology , 1933, 24, 229-232. 
Hutchinson, T. P. Some theories of performance in multiple choice tests, 
and their implications for variants of the task, British Journal of 
Mathematical and Statistical Psychology , 1982, 35, 71-89. 
Huynh, H. On the reliability of decisions in domain-oriented testing. 

Journal of Educational Measurement , 1976, V3» 253-264. 
Kale, B. K. A note on a problem in estimation. Biometrika , 1962, 43, . 

553-557. . * 

Keeslihg, J. W. Empirical validation of crit/r ion-referenced measures. 
In C. W. Harris, M. C. Alkin, & W. J. Popham (Eds.), Problems in 
criterion-referenced measurement . Los Angeles: Center for the- Study 
of Evaluation, monograph no. 3, 1974. 
Knapp, T. R. The reliability of a dichotomous test- item: A " 

"correlation! ess" approach. Journal of Educational Measurement , 
\ 1977, 14_, 237-252. . - ~* 

Koehler, K. J., & Larntz, K, An empirical investigation of goodness- 
of -fit statistics for sparse multinomials. Journal of the American 
Statistical Association , .1980,' 75, 333-342. 
Lazarsfeld, P. F. The logical, and mathematical foundation of latent 
structure analysis. 1 In S. A. Stouffer et al . (Eds.), Measurement 
and Prediction. Princeton: Princeton University Press, 1950. : 



Lazarsfeld, P. F., & Henry, N. W. Latent structure analysis . New 

York: Houghton Mifflin, 1968. . 
Lord, F. M." An approach to N mental test theory. Psychometrika , 1959. 

24, 283-302. 

.Lord, F. M. A strong true-score theory, with applications. Psychometrika , 
1965, 30, 239-270. 

Lord, F. M. Individualized testing and item characteristic curve t^ory. 

In D. H. Krantz, R. C. Atkinson, R. D. Luce, & P. Suppes (Eds.),. ' 

Contemporary developments in mathematical psychology , (Vol. 2). San v 

Francisco: Freeman, 1974. 
Lord, F. M. Applications cf item response theory to practical testing • 

problems . Hillsdale, New Jersey: Erlbaum,. 1980.X 
Lord, F. M. , & Novick, M. R. Statistical theories of mental test scores . 

Reading, Mass.: Addison-W^sley, 1968. * 
Macready, G. B., & Dayton, C. M.oThe use of probabilistic models in the 

assessment of mastery. Journal of Educational Statistics , 1977, 2, 

99-120. , .. • 

Macready, ,G. B., & Dayton,, C. M. A two-staga conditional estimation 

• .s. # 

procedure for unrestricted latent class models > Journal of 
Educational Statistics , 1980, 5, 129-756° (a) 
Macready, G. B., & Dayton, CM. the nature and use of state mastery 

models. Applied Psychological Measurements 1980, 4, 493-516. (b) 
Marks, E. , & Noll ,6. A. Procedures and criteria for evaluating reading* 

and listening comprehension tests. Educational and Psychological 

Measurement , 1967, J27, 335-348. '* ' ... / 1 
Marshall, A., & Olkin, I. Inequalities: Theory of majoHzation and its 

application . New York: Academic Press, 197$; 
McDonald, R." P. The dimensional 1 ty of tests and Items. The British 



46 • • 

■ ■ . > 

McHughi R. B. Efficient estimation and local identification in latent 

class analysis. Psychometrika , 1956, 2J_, 331-347. 
McNemar. Q* Note on the sampling error of the differences between 
-correlated proportions & percentages. Psychometrika, 1947, 12,, 
153-157. 

Mellenbergh, G. J. , & Vijn, P. The Rasch model as. a loglinear model. 

Applied Psychological Mea surement, 1981, 5_, 369-376. 
Merkauskas, J. A. Evaluation models for criterion-referenced testing: 

Views regarding mastery and standard setting. Review of Educational " 

Research , 1976, 46, 133-158. ~ - 

Mislevy, R. J., & Bock, DR. Bi weight estimates of latent ability. 
Educational and Psychological Measurement , 1982, 42, 725-738. 

Molenaar, I. On Wilcox's: t, ..ent structure model' for. guessing. British . 
Journal of Mathematical andStatisticai Psychology , 1981, 34, in. 

press. . • - T. \_ '■• ■ 

Proctor, C. H. A probabilistic formulation and statistical analysis of 

Guttman scales. Psychometrika, 1970, 35^ 73-78. ■ 
4 Rao, C. R. Linear statistical inference and its application. New York: 

Wiley, 1973. .• . <t / ■ *- ^?~J30t •"' 
Reulecke, ,A. A statistical analysis of deterministic theories. In. 

H, Spada & F. Kempf (Eds J. Structural .models of thinking and learning. 

Bern: Hans Huber, 1977. <■ 
Robertson,- T. Testing for and against an order restriction on multinomial 0 

parameters. Journal of the American Statistical Association , 1978, 
. 73, 197-202. 

ERIC 



Scandura, J.M> Deterministic theorizing in structural learning. Journal 

of Structural Learning , 1971, 3 9 21-53* 
Scandura, J.M. Structural learning: Theory and research ./ New York: 

Gordon and Breach, 1973. 
Simpson, E. Measurement of diversity. Nature , 1949, 163 , 688., 

Sirotnik, R., & Wilcox, R. Realizing the potential of latent st <cti:re 

j . . s ""'-' 

analysis for integrating and differentiating extant true score/ 
latent ability measurement models. Center for the Study of Evaluation, 
University of California^ Los Angeles, 1982. ' 
Smith, P. J., Rae, D. S., Manderscheid, R. W., .& Silbergeld, S. Approxi- 
mating the moments and distribution of the likelihood ratio statistic 
, for multinomial goodness of fit. Journal of the American Statistical 
Association , 1981, 76, 737-740. 
Spada, H.' Logistic models of learning and thought. In H. Spa da and F. 
Kempf (Eds.), Structural models of thinking and learning . Bern: 
Hans Huber, 1977. 

Stouffer, S.A. Measurement and prediction * Princeton: * Tisty Press, 

Subkoviak, M.J. Estimating reliability from a single administration of 
a criterion-referenced test. Journal of Educational Measurement/ 1 976^ 
13, 265-276. 

' * - -J » • ii . '' •• • • ■ 

Van den Brink, WV P., .& Koele, P. Item sampling, ; guessing and decision- 
making in achievement testing. British Journal of Mathematical and 
Statistical Psychology, 1980, 33. 104*108. V' 

Van der Linden, W. Forgetting, guessing, and mastery: the Macready and * 
Dayton model ^revisited and compared frith a latent t approach. • 
Jour,^ jf Educational Statistics, 1 978 , 3, 305-317 . 



Van der Linden, W. Estimating the parameters of Emrick/s mastery 

testing model". Applied Psychological Measurement , 1981, 5_, to appear. 
Wainer, H., Morgan, A., & Gustafasson, J'. A review of estimation 

procedures for the Rasch model with an eye toward- longish tests. ■ 

Journal of Educational Statistics , 1980, 5, 35-64. 
Wainer, Hi, ^ & Wright, B. D. Robust estimation of ability in the Rasch 

model. Psychometrlka , 1980, 45, 373-391. . 
Weiss, D. J.» & Davison, M. L. Review of test theory and methods. 

Annual Review of Psychology , 1981 , 32, 629-658. 
Weitzman, R. A. Ideal multiple-choice items. Journal of the American . 

Statistical Association. , 1970, 65. 71-89. 
Werts, C. E. , Linn, R. L.,< & Joreskog. A congeneric model for platonic 

true scores. Educational and Psychological Measurement , 1^73, 33, 
311-318. 

White, R. T., & Clark, R. M. A test of inclusion which allows for 

errors of measurement. Psychometrika , 1973, 38, 77-86. 
Wilcox, R.R. New methods for studying stability. In C.W. Harris, A. 
Pearl man, & R. Wilcox. Achievement test item s: Methods of study. 
CSE Monograph No. 6, Los Angeles: Center for the Study of Evaluati 
University of California, 1977. (a) 
Wilcox, R.R. New methods for studying equivalence. In C.W. Harris,.-: 
A. Pearl man, & R. Wilcox, Achievement Test Item s: Methods of Study. 
CSE Monograph No. 6, Los Angeles: Center for the Study of Evaluation 
University of California, 1977. (b) 
Wilcox, R.R. Prediction analysis and the reliability of a mastery test 
Educational and Psychological Measurement , 1979, 39, 825-839. (a) 



49 



Wilcox, R. R, An alternative interpretation 'of three stability models*, 
Educational and Psychological Measurement , 1979, 39, 311-3T6. (b) 

Wilcox, R. R. Some results and comments on using latent structure 
models to measure achievement. Educational and Psychological 
Measurement , 1980, 40, 645-658. (a)* : 

Wilcox, R. R. Determining the length of a criterion-referenced test. 

Applied Psychological Measurement . 1980, 4, 425-446. (b) 

/ 



Wilcox, R. R. A reviewer the beta-binomial model *,id its extensions. ■• 

J ournal of Educational Statistics . 1 981 . J5, 3-32. (a) - 
Wilcox, R. R. Recent advances in measuring achievanent: A response to 
Molenaar. British Journal of Mathematical and* Statistical Psychology , 

1 V 

198U 34, 229-237. (b) ■ * 

Wilcox, R. R. Solving measurement problems with an answer-until -correct 

scoring procedure. Applied Psychological Measurement ,; 1981, {[, 399-414. (c) 

Wilcox, R. R. A closed sequential procedure for comparing the binomial 
distribution to a standard. British Journal of Mathematical and 
Statistical Psychology , 1981, 34, 238-242. 

Wilcox, R.R. Some empirical and theoretical results on an answer-until- 
correct scoring procedure. British Journal of Mathematical and 
Statistical' Psychology, 1982, 35 >J 57-70. (a) 

Wilcox,. R. R. Some new results on an answer-unti 1 -correct scoring - 
procedure. Journal of Educational Measurement , 1982. 19. 67-74. (b) 

Wilcox, R. R. Using results on k out of n system reliability to study 
and characterize tests. Educational and Psychological Measurement , 
1982, 42, 153^165. U) J 



ERLC 



50 



.re 



Determining the length of < UiplR-choice criterion-ref- 
nceu tes^s when an answer-until -correct scor ng procedure is used 



Educational and Psychological Measurement , 1982, 789-794. (d) 



Wilcoky R. R 7 A comment on approximating the X 2 distribution^ the 

equi probable case. Communications in Statistics — Simulation and 
• Computation , 1982, 11, 619-623. (e) . 

VHIcoxpFL R. Bounds on the k out of n reliability of a test, and an 
exact test for hierarchically related items. Applied Psychological 

i . " . ' - * . • 

Measurement , 1982, 6, 327-336. (f) 
Wilcoxj R. R. Aclosed sequential procedure for answer-until -correct 
tests. Journal of Experimental Education , 1982, 50, 219-222. (g) 
Wilcox, R. R. How do examinees behavewhen taking a -multiple-choice 

/ test? Applied Psychological Measurement , in press. l a ) 
Wilcox, R. R. An approximation of the k but of n reliability of a test 
and a scoring procedure for determining which items ah examinee 
knows. Psvchometrika , in press, (b) 
Wilcox, R. R., & Harris, C. W. On Emrick's "An .evaluation model for 
♦ mastery testing." Journal of Educational Measurement , 1977, 14, . 
. 215-218. 

Wright, B. D. Solving measurement problems >ith the Rasch model. 
Journal of Educational- Measurement, 1977, 14 , 97-116. 



STRONG TRUE-SCORE THEORY 



. Rand R. Wilcox 
Department of Psychology 
University of Southern California 
and 

Center for the Study of Evaluation 
University of California, Los Angeles 



,78 



Rand R. Wilcox 
Strong True- Score TheoVy 

1 ' 

In mental test theory a general goal is to use obser- 

s 4 

ved cest s t make inferences ^bout an unknown param- 
eter e that represents an examinee's ability in a certain 
area such as arithmetic reasoning, vocabulary, spatial 
ability, etc. The. parameter e is frequently called an ex- 
aminee's true score. There are several types of true 
scores [3], but because of space restriction the differences 
among them are not discussed. True score models are just 
probability models that yield methods for estimating 6 
or making inferences about the characteristics of a test. 
The term strong true-score theory was introduced by Lord 
[2] to make a distinction between "weak" theories that can 
not be contradicted by data, and "strong" theories* where 
assumptions are made about the distribution of observed 
test scores. Strictly speaking latent trait : models (also 
known as vtem response theories) fall within this defini- 
tion, but the term strong true-score model is usually re- 
served for models based on the binomial probability function 
or some related distribution. Apparently this is because 
the main, focus of Lord's paper was a model based on the 
binomial probability function. x 
s , Consider a single examinee responding to n dichoto- 
mously scored items* As just indicated the best known 
strong true-score model assumes that the probability of x 
correct response is given by 



Rand R. ; Wilcox 
Strong True-Score Theory 



f(x|e)=(")e x (l-e) n ' x - > . ' (1) 

In addition to specifying a probability function for x> an 
examinee^ observed score, strong true-score. models typi^ 
cally specify a particular family of distributions for e 
over the population of examinees. When (1) is assumed the 
family of beta densities is commonly used where g(e), the 
probability density function of e, is given by 

and where r, s>0 are unknown parameters. Estimates of the 
parameters r and s are easily obtained with the method of 
moments [6] and maximum likelihood estimates are available 
from [1]. Basically the beta-binomial model falls within 
the realm of empirical Bayesian techniques, as do most 
strong true-score models, the beta-binomial model frequent- 
ly gives a good fit to data, and it provides a solution to* 
many measurement problems [6]. Included are methods of 
equating tests and methods of estimating test accuracy and 

reliability. ^ ' 

^Several objections have been raised against the beta- 
binomial model v but from [6] the only objection that seems 
to have practical importance is that the model ignores ° v 
guessing. Here a correct guess refers to the event of a 
. correct response to a randomly sampl ed item that the exam- 
inee does not - know. Fof a strong true-score inodeV where a 



Rand R. Wilcox 

Strong True-Score Theory - .* \ ■ 
3 

correct guess is defined in terms of randomly sampled exain- 

inees (and where items L ai^-fixed)v see [12]. 

Suppose every item has t alternatives, and for a spe- 

cific examinee let e be the probability of knowing a rand- 

omly-sample item. Morrison and Brockway.[4] assumed random 

guessing in which case 

-1, V x " 

e = s+t 1 (i-c) • > 
.and the density of 6 is. 

- g(e)=^ gl^), t" 1 ;* 6 * 1 -' 
Unfortunately it appears that the random guessing as sump- . 
tion is. unsatisfactory.. The only model that has given 
good results is one proposed by Wilcox [8, .9] that is based 
on an ariswer-until correct scoring procedure 5 and the assump- 
tion that an examinee's guessing ability is a monotonic 
function of 9. By an answer-unvil -correct scoring procedure 
is meant that an examinee chooses responses to a multiple- 1 
choice test item until the correct alternative is chosen. 
These tests are usually administered by having an examinee 
erase a shield on especially designed answer sheet^. Under 
the shield is a letter -indicating whether the correct 
answer was chosen. If -not/ another shield is ^erased* and 
th'e process continues until the correct alternative is se- 
lected. K 1 
Let 5. be the. probability that an examinee can elimin 



Rand R. Wilcox 
Strong True- Score Theory 

4 ' 

ate i distractors from a randomly sampled item,. i=0,l,. . . , ; 
t-i. It is assumed that when an examinee does not know, ... ' 
there is at least one distractor that can not be eliminated 
through partial information, and so 5^ = 5. It is alsa 
assumed that an examinee eliminates as many distractors as 
possible, and then guesses atrandom from among the alter- 
natives that remain. For empirical evidence ,in support of • 
this last assumption, see [11]. If e i is the probability ' 
of a correct answer on the Uh try of a randomly sampled . 
item, then . y 

and so the s.'s can be estimated.. If x. is the number of 

..J • 1 " 

items requiring, i attempts, it -is assumed that the x^s 
have a multinomial probability function. It 'is also 

assumed that 8 has a beta density with parameters V . and s/ 

V 1 • ' • ' ' . . 

and that : " . _ 

EC-jrf-ie^'c/Q 1 hCu)du + t 1 ' . • (3J. . 

where c is an unknown parameter, and h(*j) is also a beta 

density but with parameters a. and b.; The model implies^ that 

and so the lower limit for the integral in (3) should be 
t" 1 , but this modification has not .yet been applied to. real • 
data. Equation (3.) is based on the. assumption that the more 



/ Rand R, Wilcox 

, Strong True-Score Theory 




items an examinee knows, the higher the probability will be 
that an examinee will give a correct guess to an item that 
is not known. The parameters b. and c are currently.es- 
timated using what is basically the method of moments. The 
details are too lengthy to report here'; the interested; 

reader is referred to [8]. / 

* "i if - ". 

As a final ^note, there are now extensions of strong 

- true-score models based on closed sequential sampling te 

niques which might be useful in co r xiterized testi-n 

closed sequential sampling is meant that, items are 

sampled and administered until some criterion is met. T-fce 

criterion actually used will depend on the purpose of the 

test! ' . 

» • Hf ■ . 

Consider, for example, a criterion-referenced test 
where the goal is to determine whether 6>6g where 9|j is a , 
known constant. Suppose e>6g is decided ]f and only if - 
x>c, where c (.a positive integer) U\ some. known passing 
score. Given that e>e 0 (or that 6<6q), the probability of 
a correct decision is available immediately (given e) If, M 
the binomial model is assumed. For 1 relate^ results, see Il6J 

Suppose instead that items are randomly sampled until 
an examinee gets c items correct or m^h^cri "items wrong. 
Let x(y) be the number pf correct (incorrect) responses, when 
the sampling of iitiBms terml nates • The joint probability : ? 



Rand R. Wilcox 

Strong True-Score Theory 



6 



function of x and y is 

f(x,y|e) = Li^ ! e x (l-e)^ v . 
where x s c and 0<y<m-l or where y=m end 0<x<c-l, and L=m 
if y=m, otherwise L=c. Wilcox [7] showed that the pfobabil 
ity of a correct decision under the closed sequential pro- 
cedure is exactly the same as it is under the binomial 
model, but the expected number of items is always less. . 
For results on estin^t^ e under the closed sequential- 
procedure, see [13].' For extensions to the multivariate 
case, including an application to answer-until -correct 
tests /see [14, '15]. 



Rand R. Wilcox- • ...,•> 

Strong True- Score Theory ■•* V 



7 ' 

c 



References 

[1] .Griffiths, D. A., (1973), Bfbmetrika , 29, 637-648. . 
[2] Lord, F. M.\ (1966), Psychometrika , 30, 239-270. * 
[3] Lordj F. JM. , & Novick, M. R., (1968), Statistical 

theories of mental test scores , Addi son-Wesley, 
• Reading, Mass. 3 the current classic on mental 

test theory. 

[4] Morrison, D. G. , & Brockway, G., (1979), Psychometrika , 
* 44, 427-442. 

[5].. Wilcox, R. R., (1980),- Applied Psychol ocrical Measure- 
ment , 4,-425-446.. ' 

[6] Wilcox, R. R. , '( 1981a.), x Journal of Educational S ta- 
ti sties ,' 6, 3-32. A review of the beta-binomial 
model, with an emphasis on. mental test theory. 

[7] Wilcox,' R. R., (.1981), British Journal of Mathematical 

and Statistical Psychology . 34, 238-242. 

' - • N ... (i, ...... - . '*•»' ■ - ■• : 

[8] Wilcox, R. R. , (,1582a), British Journal of Mathematical 

and Statistical Psychology , 35, 57-70. The only 

' -item sampfing model that has given satisfactory 

# v< 

results when dealing with guessing. 
[9]' Wilcox, R. R.\ (1982), Journal of Educational Measure- 

; , roent , 19, 67-74. • 
£10] Wilcox, R: R. .„ Il982h Journal, of. Experimental Edu- • 

• ■ cation, 50; 219^222. . ~) 



Rand R. Wilcox 

Strong True-Score Theory 

*8 . " 



[11] Wilcox, R. R., (1983), Applied, Psychological Measure- 
ment , 8, 23$- 240. 

[12] Wilcox, R. R.', (1983). Psychometric , 48, 211-222. 

[13] Wilcox, R. R. , (1983). Educational and Psychological 
Measurement , in press. . 

[14] Wilcox, R. R. , (1982), British Journal of Mathematical 
and Statistical Psychology , 35, 193-207.. 

[15] Wilcox, R. R.: (1982). Educational and Psychological 
Measurement , 42, 789-794. 

[16] Wilcox, R. R. , (1979), Psychometrika , 44, 55-68. ■ 
-The paper considers the. problem of determining 
whether an examinee has a true score greater 
than or less than e 0> * an unknown parameter that 
characterizes a control group. 



APPROXIMATING MULTIVARIATE DISTRIBUTIONS 



V 



ERIC 



v 



Rand R. Wilcox 
Department of. Psychology 
University of Southern California 

and • v ; 
Center f or the Study of Eval uati on 
University of California, Los Angeles 



87 



* ; * , ABSTRACT 

v v • I. ' ■ ... , . t , 

A simple approximation of a multivariate distribution is suggested 
that may be useful in certain situations. Comparisons with several 
other approximations suggest that the' new approximation nearly always 
gives better results. In some cases the improvement is minimal » but 
for some situations substantially better results are obtained. 



Let X^,.. . $ X k be'k random variables with joint density f(x 1$ . ,x k ) , 
-and let . .. " 

P=Pr(X 1 <h r ...,X k <h k ) ' ' " (1.1) 

Of course inmany situations (1.1) must be evaluated, but frequently ap- 
proximations are poor. In some cases P can be evaluated exactly using 
quadrature tephniques, but this can be prohibitively expensive, and the 
necessary computer programming does* not always, exist. The goal in this 
paper is to suggest a simple approximation of P that appears to be useful 
in. various situations, and which appears to compare favorably to some 
other approximations that have* been used in the literature. 

The proposed approximation is based on a second order Bahadur apprpx- ' 



The motivation for this approxi- 
{Wilcox, 1983) which in- 



imation of a multinomial distribution, 
matlon stems from a recent investigation 1 
eluded, amomS other things, an approximation of Pr(£ y^>m), where the 
y^'s are binary random variables. A second order Bahadur approximation 
proved to be more accurate than expected, and this* led to the approximation 
and comparisons madp here. Another motivation for this approximation , 
stems from results reported by McFadden (1955) where it was suggested - 
that a special case of the approximation used here will frequently give 
good results for k=4. " v 

In section 3 the accuracy of the approximation is investigated by 
applying it to some distributions where R is known exactly for certain 
special cases. The. results suggest that the approximation nearly always "< 
improves upon all four approximations of the multivariate t distribution 



proposed by Dunnett and Sobel (1955). In a few instances the improvement 
1s substantial. It also appears to improve upon an approximation of the 
multivariate normal distribution proposed by Olkin, Sobel , and Tbng (1976) 
Finally, the approximation is compared tcsome percentage points tabled 
by Dudewicz and Dalai (1975), and found to give good results in most cases 
as long as k is not too large. Compared to the Bonferroni inequality, 
there are again situations where there Ts considerable improvement. 

2. The Approximation 

Let y a (yi»....y^) be a random vector ,where y^O or 1 (i=l,...,k), 

and let p(y2,...,y k ) °e the corresponding probability function. -Bahadur 

(1961) showed that p^,....^) could be written as 

p(y) - Pi(y).9(y) • . 

where' 

k y 

P,(y) * n a 1 (l-au) l " y 1- . 

1 - 1*1 1 1 

\* ECy^) - .;' ' 

g(y) - i+.^r^ij +' ^Vi 2 ^ + ~ + r i2...k 2 r" z k 

'*i m (yi - ai)/Ca i (l-a 1 )J ,S | 

r ij* lci i?jf; I . . • . - . ..f 

V 1jm* E ( 2 1 2 jV - " 

v ; . .- 
. . %m *" ' . '-. - - 

, ' ... - ■ , • ; 

• ••'-.'* . ' * . 

- ~. - ■ * r - • ■ **• 

r l2...n^ 2 l 2 2"- 2 n ) * • 



An mth order approximation of p(y) is obtained by retaining the first 
m terms in the expression for g(y). In particular, a second order ap- 
proximation is 



Define 



P,(y) ri+V r ..2,2.1 (2.1) 
r l w „' *• 1 J" 1 . 



11, *.f X t <h. . ' (2.2) 



*1 

Then an approximation of P is just 



0, otherwise. ' 



U . V ■ Pr(X i <h it X,<h,)-Pr(X i <h i )Pr(X j <h j ) 1 



(2.3) 



In many practical situations Pr(X.<h. ) (i=l k) have a common value V, 

. I "* i * ■ 

and Pr(Xj<hj f Xj<hj) have a conmon value U for all 1n which case 
(2.3) becomes „ 



r 

Bahadur (1961) noted that the approximation (2.1) will be a probability 
function if 1+^ r.jjz.jZ^O, but that otherwise some of its values will 

be negative. This problem never arose in the cases considered here,. 



91 



ERIC 



3.1 • The Multivariate t Distributions . 

Suppose the joint probability density function of X 1 ,...,X^. is multi- 
variate normal with correlation matrix {p.. }, mean vector 0 and common vari- 
ance a % 6ma nS/a has a chl-SQUBre distribution independent of the X^'s, 

with v degrees .of freedom.. Then the joint density of T^X^S (i=l.,,,.k)'.1s 
multivariate t, and the joint: pdf . (probability density function) is 



fttl ,...,t k ). £«mmC- p" 1 fcuVs] 



• (v+k)/2 

l/0 i 1 »o i (3.1) 

niTMv/Z) 



where A is the determinant of the positive definite matrix {a^ j}={p ij .}' 1 

This distribution arises in ranking and selection (Bechhofer, Dunnett 
and Sobel, 1954) where the goal is to determine which..of k+1 normal disr 
tributions has the largest mean. Another application was discussed by 
Dunnett (1955) where the goal is to compare the- mean of k normal distri- 
buttons to a control . (See, also, Gupta, and Sobel , 1958.) Krishnaiah. 
(1965) used the distribution to make multiple compari-sons in th& multi- 
variate analysis of variance. Properties of this distribution are summar- 
ized by Johnson and Kotz (T972) arid Gupta (1963.). — ~ 

For k=2 exact expressions for (3.1) are available (Dunnett and Sobel, 
1954), but for k>2 approximations must be used except for certain special 
cases where exact results have been tabulated. An approximation was sug- ^ 
gested by John (1961), but unfortunately it 1s complicated, and some 
quadrature 1s' required. Four approximations (lower bounds) were proposed 



by Dunnett and So.bel-(1955). These' were ; 

1 -I PrCT^h.) (3.2) 
k 

n PrCT^h/} (3.3) 
i=l ' '" ' 

k/2 , •*••*•' 

n ; Pr CT 21 _ 1 <h 2 ._ ls Pr(T 2 ^<h 2i ) , k even " ; (3.4a) 

(k-l)/2 

PrCT^hj) n Pr ^ T 2i <h 2i * ^i+l^i+l^' k odd (3.4b) 

(Pr (T 1 <h;, T 2 <h)} k / 2 . (3.5) 

The last lower bound assumes h 1 =h 2 =...=h k =h, say, h>0, and j=x -Xj 
for sdme constants A. where Ckx,<l (i=l,. ..,k). Expression (3.4) also ^ 
assumes that p ^ j=X^Xj and that h..>0 (i=l,. .. ,k). For p|j=P» Tong (1970) 
gives the lower bound 

J. 

but'he shows that this bound 1s not as sharp as (3.5). 

Dunnett and Sobel cofmpared their lower bounds, to the actual P values 
for the Important special case p*H and where th^ h. 1 s have a conmon value, ; 
h. For v=» and (l.'l) close to one, their comparisons suggest that the - 
lower bounds' are reasonably accurate for k=3, but for k=9 the accuracy 
diminishes considerably, r They also examinecLthe case v = 5. For k=3 the 
approximations, were tolerable, but for k=9 the approximations.were poor. 



The approximation (3.5) consistently gave the most accurate results. 

Table 1 shows the exact value of h so that P=.9f9, .95, .75, .50 for. 
-k=3,9. These values were taken from Dunnett and Sobel (1955). Included 
[ 1n the table are the values of h determined with (3.2) and (3.5) and (2.4) 
As can be s£en, (2.4) nearly always improves upon both (3.2) and (3.5) 
without making any assumptions about the structure of the correlation 
matrix tp^j).' For P close to one there 1s little Improvement over the 
other approximations, primarily because (3.2) and (3.4) give fairly ac- 
curate results. As P decreases, though, (2.4) begins to gtve reasonably 
more accurate results. 

Table 2 shows the approximation of P for v=5, k*3,9 and various 

values of h. Again (2.4) nearly al way s, Improves upon (3.2) and (3.5), 

< . . - « - 

but unfortunately all three approximations are poor for k°9 unless P 1s 

to * . 

close to one. Also observe that (2.4) 1s substantially more accurate 

for k*3 and P«.5. ©. 

3.2 Approximating a Distribution Occurring 1n Ranking and Selection 

Let T, (1*l,...,p+l) be p+1 independent random variables all having 

a Student's t distribution with v degrees of freedom, and let H^T^-T^ 
* ■ • . * . • . . 

(1»l,...,k). The joint distribution of the W^s arises 1n the ranking 

and selection problem cons1dered-by Dudewlcz and Dalai, (1975). Table 3 

shows the exact value of P (which was taken from the table in Dudewlcz 

and' Dalai ) and the value of (2.4) for k*3,5; h»l,2,4 and v«l;l4,2?; The 

value of (2.4) was determined using the table in Dudewlcz and Dalai. As 

\ * ■ . ■ ■ * » *. • ■ " . * •? ■ / • 

can be seen, the approximation does not always work well when v*li but 

otherwise it" gives reasonably accurate results V Table 3 also* Includes" 
an approximation based on the Bonferroni inequality- P>l7lPr(W|">h|V but 
as 1s evident (2.4) gives better results and in- most cases the improve- 
ment Is substantial. / \ 



-7- 



,3-3 Estimating the Probability of a Correct Selection in Ranking and 
Selection 

For the final comparison, let x i»«««tX k+1 be k+1 independent standard 
normal random variables. Estimating the probability of a correct selection, 
in ranking and selection problems requires evaluating 

** ■ 

Evaluating (3.6) also plays a central role in Tong (1978). 

01 kin, SobeT~and Tong'"tl976)^ugpst^a"fM11y of approximations of 
(3.6) that are ^ased on majprization. To Illustrate the accuracy of their 
approximation they consider k=5, 1^=3.2,' h^2\I 9 h 3 *2.5, h 4 =1.9, h 5 =1.7. 

The exact value of (3.6)is .8016. The closest approximation (in absolute 
value) based on their approach is .7802. If instead (2.4) is used, we 

get .8171. Obviously this one case is naj a compelling reason to abandon 

i 

the approximation proposed by 01 kin, SobeT and-Tong. It is difficult to 
make extensive comparisons because the quantity approximated by Ol^in et al. 
1s generally unknown. The point 1s that we have one inore gxample where 
C2.4) gives good results. ' : 

Henery (1981) suggests another approximation of (3.6), which we 1 
compared to some of the exact value? 1n Bechhofer (1954). For Jc*3 it 
worked reasonably well for P<.8» but unfortunately for P>.8 1t gave very 
poorNresuUs and so 1t was not considered further (cf. Sathe 4 Ungras, 

X ' $ * ' ■ \ 

1980; R1 x c< et al., 1979). ^ . 




4. ; ( \ Summary and Concluding Remarks 

In some Instants the approximation (2.3) will give very accurate 
results, but as was Illustrated this 1s not always the case. However 
it seems to usually give treasonable approximation in most situations 
when k- is notTtoo Jarge£ Moreover i it 1s easy to use when the exact dis- 



tr1but1on-_1s known for k=2, and so it may be useful in certain situations. 
More importantly, (2.3) appears to compare "favorably to various approxi- 
mation that have been proposed in t the past,- and it can give considerably 
better results when P is not too close to one. It is i nteresti ng. to' . 
note that the Bonferroni inequality is known to Usually give (accurate , ' % 
results when P is close to one; (2.4) generally giv6s. an even better ap- 
proximation in these cases, but the improvement is not overly striking. 

'for distributions related to Student's t distribution, the compari- 
sons made in Tables 1 and 2 suggest that (2.4) works tolerably well v for - 

— . . . ✓ , 

k=5 and v, the degrees of freedom, as small as 14. For k a 3, (2,4) seems 
to ever^work reasonably well for v-5. However, for k=9-, all of the 
approximations considered here appear to be highly inaccurate except foY * 

a few cases where P i,s close to one. y 

> _ ». . v* 

Finally, no analytic 'results were! given^on the .accuracy of C2.3)L, 
but the only analytic result concerning the other approximations is that 
they provide bounds for P. In some instances these bojjnd§ can be extremely' 

Inaccurate, in which case (2.3) jnight be considered. In fact, in terms 

\ • » t * - . 

of obtaining accurate' approximations, the only % motivation for •preferring 
existing bounds is that they were invented f-jrst.. 



TABLE 1 



Comparisons for the Multivariate Normal Case of : Exact and - 
• Approximate 7 Percentage Points, h, for Selected Values of P 







k=3 


> 




k=9 






p ' 


(3.2) 


(3.4) 


(2.4) Exact 
-2^9 , 2.68 


(3^2) 


(3.4) 


(2.4) 


* Exact" 


.99 


2.71 


2.70 - 


3.06' 


3.05* 


2.97 " 


3.00 


..95 


2.13 


2.09 


2.0& 2.06 


'2.54 


2.51 


2.29 


2.4fc 


.75 


1.38 


1.26 


1.16 1.19 




1.82 


■„ 1..30 


1.60 


.50 


-.97 


.70 


.56 . 59 


1 . 59 


1.38 . 


.85 


1 . 04 








TABLE 2- 








* 




Comparisons for the Multivariate, t of Approximate and 






* — - 


Exact P values for Selected Values of h, v=5 

• * 


\ 






h 


(3.2) 


(3.$ / 


(2.4) 


Exact 








4.21 


.987 


■ * -989. 


.990 


.99 






> 


2.69 


. 931' 


-.944 ; 


.954 


" .95^ 






J 


1.32* 


\ .625 


. .721 


.770 


.75 






.62 


• .139 


.445 
k=9 


.515 

• 


.50 ' 








5:03 


.987 


.989 


.998 


.99 








3.30 


.903 


.913 


.997" 


.95 








1.81 


.'415 


. 597 


.944 . * 


.75* 








1.10 


0 


.269 

* 


.655 


.50 







•10- 



TABLE 3 



Approximation^ of Values Tabulated by 
Dudewicz and palal 



^ k=3 
Bonferroni (2.4) 



Exact 



V, 


h 














1 


1 


0 


• 4'22 . 


.402 




.325 


.285 


14 


1 ' 


.249 


.552 


.537 

* 


0 


.483 


.431 


29 


. i 


^265 


.559 


* .545 


0 


.491 


.440 


1 


2 


.'250 


.572 


.541 


0 


.519 


.414 


r • 
14 


2 


.724 


.806* t 


.798 


.540 


.776 


.726 


29 


2 


.745 


.818 


.811 


.575 


.788 


.743 


t 

1 


4 


.557 


.743 , 

b 


.711 


" ! .262 


.750 


.605 


14 


4 


.983 


.985 


.981 


-.-$71 . 


.980 


.977 


39 


4 


.989 . 


.990 ' 


. 990 


.981 


.986 ' 


' .984 



k=5 

Bonferroni (2.4) 



Exact 



ERIC 



,98 



References • 9 ' * 

✓ - - 

Bahadur, R. R. A representation of the joint distribution of responses 

to n dichotomous items. In Hv Solomon (Ed.), Studies in item . 

, analysis and prediction . Stanford: Stanford- University Press, 1961. 



Bechhofer, R. E. A single-sample multiple decision procedure for ranking 

means of normal populations with known variances. Annals of Ma'tbe- 

. • ' \ I f 

matical Statistics ; 1954. 25, 16-39. . 

"*\ / A. ■ * • * • 

Bechhofer, R. E. , Dunnett\ C. W., & So&el , M. A two-sample multiple de- 

cision procedure for ranking means of normal populations with a 
\ \ ■ " 

common unknown variance. XBiometrika , 1954, 41, 170-176. - 

\ 7 ■ ~ \ 

Dudewicz, E. J., & Dalai, S. R. Allocation of observations in ranking / 

and selection with unequal variances, Sankhya , 1975, Series. B, 37, 

28-78. \ 

• ' ' ' ' % \ 

Dunnett, C. W.. A, multiple comparison procedure for comparing several 

treatments with a control. Journal ofx t he American Statistical 

• , ..: - •• \^ : ^ — ^ 

Association , 1955. 50. 4096-1121. A . - ' 

Dunnett, C. W., & Sobel , M. A bivariate generalization of Student's t 

\. 

distribution, with tables for certain special cases. Biometrika . 

• 1954, 41, 153-169. 

- ' ■ ■ ■ * ✓ 

Dunnett, C. W. , & Sobel, W"'' Approximations to the probability integral" 

and certai n percentage points* of a multivariate analogue of Student's 

t-distribution. Biometri ka, 1955, 42, 258-260.. 



99 



Gupta, S. S. Probability integrals of multivariate normal and multi- 

. < •' ■ ' s ' . . , 

.variate t. Annals of. Mathematical Statistics , 1963, .34,. 792-828. / 

Gupta, S., &.Sobel, M. On selecting a subset which contains all populations 
"better than a "standard. Annals of Mathematical Statistics* , 1958, 29, 
•235-244. ' . f * 

Henery, R, J. Permutati oh probabilities .as. model's^ for horse/races-V 

Journal of- the Royal Statistical .Society , 1981, Series B, 43, 86-91. 

John, S. On the evaluation of the probability integral of the multivariate 
t.' Annals of Mathematical Statistics , 1961, 48, 409-417. . ' . - , 

Johnson, M.,'& Kotz, S. / Distributions in Statistics: Continuous Multi- 
variate Distributions . New York: Wiley, 1972. '. , \. 

Krishnaiah.'P. R.' Multiple comparison tests. in mul t1 -response experiments: . 
Sankhya , 1965, Series A, 27, 65-72. ^ L ^ ' \ . . 



McFadden, J* A. Urn models of :orrelation apj i (fomparison wijfr the ; * 
; v u ■ ■ r \ h ' \ \ 

multivariate normaMntegral. Annals of Mathematical Statistics , 

1955,. 26, 478-489. '• ' . ' ' •" 

Olkln, I., Sobel, M., & Tong, Y. L. Estimating. the true, probability of \_ 
. ': . ■ .• • » 

porrect selection for location and scale parameter families/- .Depart-^, 

ment of Statistics, Stanford, Technlcar Report No. 110, 1976. „ 

Rice, J., Reich, T., &Clon1nger, C. R. An approximation to .the multi- 
variate normal integral : Its application to jnultl factorial qQali- 

i ■ * ** . * 1- 

, tative traits. Bionfetrics ,' 1979, 35, 451-469 : . • « * 
Sathe, *Y. H/ t & L-ingras, S. R. A note on the inequalities- fpjq tall pro b- 
ability of the multivariate normal distribution. Conwuni cations in 
Statlstics^Theory and Methods , 1980, A|, *711-715. * . . 



-13- 



C 



/ 



: Tong, Y. L, . Some, probability inequalities of multivariate normal and 
multivariate t. Journal of the American Statistical Association , 
6 ■ 1*970, 65, 1243-1247. ■ . . 

TtNK)*, Y. L. ' An' adaptive solution to ranking and selection "problems." ' 

Annals of Statistics , 1978, 6., 658-672. 
. Wilcox, R. R. An approximation of the k out of n . reliability of a test, 
• and* a scoring procedure, for determining which items an examinee^ , 
' knows. Psychometrika , 1983, .48, 211-222, . 



J 



101 



ERIC 



UNBIASED ESTIMATION IN A CLOSED SEQUENTIAL 
TESTING PROCEDURE 



Rand R. Wilcox . 
Department of Psychology 
University of Sourthern California 

and 

Center for tlje Study of Evaluation 
University of California, Los Angeles 



102 



ABSTRACT 

• / 

\ ♦ ■ 

* * 

Letp be the proportion of 'items within an item domain that an ex- 
aminee would answer correctly if every itern were attempted. This brief 
note provides unbiased ^estimates of p*, -for any integer t, when a closed 
sequential testing procedure Ms used. 



Consider'a single 'examinee, a domain of items, and let p be the 
examinee's domain score or true score. jThat is r p is the proportion of 
items, in the domain of items that the '.examinee Wbuld get correct if every 
item were attempted. In some cases it is assumed that'z, the number cor- 
rect observed score, has abinomial probability function, and that for 

the population of examinees 'the distribution of p belongs to the beta 

/ -. , »■ 

family. This beta-binomial model has been used to solve many measurement 

\ y J 

problems CLord, 1965; Lord & Novick, 1968; Wilcox/ 1981a) . 

Let p 0 be a known constant, 0<p 0 <l e . In criterion-referenced testing 
a common goal is to determine for every examinee whether/p>pg. Usually 
this is done by administering n items to every examinee and deciding 



p>P 0 if arid only if z/n*p Q . Wilcox (1981b) pointed out that it is 'possible 
to imprtfve uniformly on this procedure when computerized testing is fea- 1 
sible., The procedure is based on a. closed sequential sampling scheme. 
This means that items are sampled and jadministered one at a time unttl 
an examinee gets m items right or M items Wong. In Wilcox, (1981b) m was 
set equal to che'smallest integer z such that z/n>p Q , and then M was set 
equal to n-m+1. /' . 

The purpose of this bn ; ef note is- to provide unbiased estimates of 

t I : ■ \ 

p for any integer t* l<t<m. It is noted that, for t=l, an unbiased es- 

" '7 _ _/ _ ■ t X. _ \ • 

timate is eastly derived from results in^irshicket alv(1946). 

/■ / - ' -> \ 

After /sampling terminates, let x be the number "of items the examinee 

answers correctly, and 1st y be the number for which an incorrect response • 

is given. ■ The unbiased estimate of p is \ 



104 



0 

ERIC 



pt m 



(m - t - 1 + y] 7 I'm - l*+ 
H m-t-1 J '/ i m-1 

x-t, H 7 L' m -i , 



if x° = *m 
if y = M. 



where 



'M+x-t;i)=0, if x<t. 
x-t * 



.To establish the above result, first it is noted/that from Wilcox (1981b), 
. the joint probability function of x and y is/ 



„ f(x,y|p) = 



m - 1 + y 
m-1. 

fM - 1 f x 
{ M-1 - J 



p m (l-pK, if x 
p x (l-p) M , i.f-y=M. 



y • 



m 



't t 

Proceeding as is done for the binomial case, it follows that E(p )=p . 

Henceforth, p*~ will "be written as p. when t=l. The maximum likelihood 

•estimate of p is P m =x/(x+y). To gain some insight into how p ajid p m j:ofnpare, 

selected values of E(p-p) and -E(p m -p)„ were computed, and 'the results are 

» <%. • 

reported in Table 1. As can be seen, p generally gives more accurate re- ° 

suits than p m - >..'.! .".<'' 

- t 

Two situations are briefly noted where urtbiased estimates of p are 
important. The first iV estimating the- true score distribution. Suppose 
'that for the population of examinees,, p has* a beta density given by 



g(p) - rCr+sJ/CrCrlrCsJlp^^i-p)^ 1 



(1) 



; -2 



'where r,s>0 are unknown parameters. To estimate r and s, -1 et p... and p.. 
be the unbiased estimate of p^ and p^, respectively, for the.ith randomly 
sampled examinee, i=l,...,N. Proceeding as in Griffin and -Krutchkoff •.. 
(1971), it follows that . * 



7* 



'.All 



105; 



can be used to estimate E(p t ), where the- expectation is taken with re- 
spect to the beta density. Thus.'r and s can be estimated as described 
in Wilcox (198ia). * , . 

The second illustration has to do with the. optimal linear estimator 
of p. Because p is unbiased, the linear estimate, p, that minimizes 
EpE(p-p) 2 is given by' p=Cop/a 2 ) (p-PjJ+pj where a* is the variance of 

the marginal distribution of xT a 2 = p 2 ~p 2 and P t =E(p t ) (Griffin & Krutch 

0 " r u 

koff, 1971). From the results given above a p and pj can be estimated 
yielding an estimate of p (cf. Wilcox, 1978). „ 



V 




REFERENCES 

Girshick, M. A., Mosteller; F., & Savage, L. J. Unbiased estimates for 

• * (I * • 1 * 9 

■ certain' binomial- sampling problems with applications. Annals of 
Mathematical Statistics , 1946, 17, 13-23. . . 

Griffin, B. S., & KrutChkoff, R'. G. Optimal linear estimators: Ah 
empirical Bayes version with application to the binomial distri- 
bution. Biometrika , 1971, '58, 195- 2.01. - ¥ 

Lord, F. M. A strong 'true-score theory, 'with applications. Psvchometrika , 

. 1965, 30, 239-270. • • """ • , 

Lord, R. M. , & Novick, M. R. Statistical theories of mental test scores . 

* ' . '" ' • ! " * '" •-' •— ' j ' ■ - - . ■ 

• ! / • i 

Reading Mass:: Addi son-Wesley, 1968. - 
Wilcox, R. R\ Estimating true score in the compound binomial error model . 

Psvchometrika , 1978, 43, 245-258. (a) 
Wilcox, R. R. A revi.ew of the beta-binomial model and its- extensions. 

Journal of Educational" Statistics , 1981, 6, 3-32. |a) 
Wilcox, R. R. A closed sequential procedure for comparing the binomial, 

distribution to a standard. British Journal of Mathematical and 

Statistical Psychology , 1981, 34, 238-242. (.b) 




- TABLE 1* 



VALUE OF E(p-p) 2 AND E(P m "P) 2 ""■ % t 





-M p: 


' x .1 


'•-.v2. 


•;3 


.4 


— -.5 


5 


5 

c 


.0158 
.0126 


10276 
.0236 


.0338 
.0350 


.0368 
.0448 


.0376 
.0489 


• 

5- 


10 


. 0083 
. 0076 


.0143 
.0170- 


.0198 
.0271 


.0256 
.0326 


.0307 
.0334 


5 


15 


.0056 
.0059 


.0109 
; . . 0154 


.0175 
.0236 


.0246 
.0282 


. 0304 
.0308 


5 


" 20 


-.0044 
.0053 


. 0098 
.0144 ' 


.0171 
.0216 


.0245 
..0270 


.0304 
. 030$ 


10 


10 


.0083 
.0071 


.0133 
.0120 


.0157 
- .0155 


_^.0162 
▼ ~.0190' 


.0162 
.0208 


10 


15° 


. .0055 
.0050 


. 0088 - 
.0084 


.0106 
.0122. 


.0122 
.0157 


.0141 
..0163 



*The first entry in every cell ii E(p-p) , and the second entry is 

Ec; m -P) 2 . 



